SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

KINC isn't using GPUs. #59

Closed spficklin closed 5 years ago

spficklin commented 5 years ago

I recompiled KINC today to test that the fix for issue #57 was working. Using qkinc all worked just fine. Now, I'm trying to run on 4 nodes with 16 GPUs and KINC is not behaving as expected. There is nothing reported in the log file. It does appear to be running but only on one of the nodes but it is not using the GPUs. With my previous experience, KINC seemed to start using GPUs relatively quickly.

spficklin commented 5 years ago

Update: the .cmx and .ccm files are being updated, but the GPUs don't appear to be in use.

4ctrl-alt-del commented 5 years ago

I just finished running your exact command using an exact copy of your EMX file with no error. It took about 70 minutes to run and utilized all GPUs for the whole time. The exact commands I used are this;

idev --partition=ficklin --account=ficklin --nodes=5 --ntasks-per-node=4 --gres=gpu:tesla:4 --time=96:00:00
srun --mpi=pmi2 -l kinc run similarity --input CVM_CAHNRS_bovine.GEM.FPKM.log2.emx --ccm CVM_CAHNRS_bovine.GEM.FPKM.log2.ccm --cmx CVM_CAHNRS_bovine.GEM.FPKM.log2.cmx --clusmethod gmm --corrmethod spearman --minexpr -inf --minsamp 25 --minclus 1 --maxclus 5 --crit ICL --preout TRUE --postout TRUE --mincorr 0.5 --maxcorr 1 --bsize 32768 --gsize 4096 --lsize 32

Perhaps you need to recompile ACE too?

4ctrl-alt-del commented 5 years ago

@spficklin you messaged me saying it still does not work. Did you recompile ACE too?

Also look at your settings...

$ kinc settings
          OpenCL Device: 0:0
        ACU Thread Size: 4
        MPI Buffer Size: 4
Chunk Working Directory: .
           Chunk Prefix: chunk
        Chunk Extension: abd
                Logging: on

Is your OpenCL device is set to "none"? If so then ACE will not use the GPUs.

spficklin commented 5 years ago

I did not recompile ACE because there weren't any code changes when I ran a git pull. I will try it though.

spficklin commented 5 years ago

@4ctrl-alt-del Thanks for looking into this. It does appear that a recompiling of both ACE and KINC solved the problem. But I'm a bit baffled by that because I remember checking if there were any updates to ACE between the time that I made this issue and when Ben said he fixed it and I remember there were none.

spficklin commented 5 years ago

So as an update. My KINC settings were off a bit:

          OpenCL Device: none
        ACU Thread Size: 4
        MPI Buffer Size: 4
Chunk Working Directory: .
           Chunk Prefix: chunk
        Chunk Extension: abd
                Logging: on

Notice the none on the OpenCL line. To fix this, I had to log on to a node with a GPU, make the settings change and then I could change it.

After pondering on this a bit, the setting got changed when I was running qkinc and I edited the settings to set the the OpenCL deveice to none so that I could test KINC without it trying to use GPUs. I failed to remember that this was a persistent change between KINC runs.