Closed spficklin closed 5 years ago
Update: the .cmx and .ccm files are being updated, but the GPUs don't appear to be in use.
I just finished running your exact command using an exact copy of your EMX file with no error. It took about 70 minutes to run and utilized all GPUs for the whole time. The exact commands I used are this;
idev --partition=ficklin --account=ficklin --nodes=5 --ntasks-per-node=4 --gres=gpu:tesla:4 --time=96:00:00
srun --mpi=pmi2 -l kinc run similarity --input CVM_CAHNRS_bovine.GEM.FPKM.log2.emx --ccm CVM_CAHNRS_bovine.GEM.FPKM.log2.ccm --cmx CVM_CAHNRS_bovine.GEM.FPKM.log2.cmx --clusmethod gmm --corrmethod spearman --minexpr -inf --minsamp 25 --minclus 1 --maxclus 5 --crit ICL --preout TRUE --postout TRUE --mincorr 0.5 --maxcorr 1 --bsize 32768 --gsize 4096 --lsize 32
Perhaps you need to recompile ACE too?
@spficklin you messaged me saying it still does not work. Did you recompile ACE too?
Also look at your settings...
$ kinc settings
OpenCL Device: 0:0
ACU Thread Size: 4
MPI Buffer Size: 4
Chunk Working Directory: .
Chunk Prefix: chunk
Chunk Extension: abd
Logging: on
Is your OpenCL device is set to "none"? If so then ACE will not use the GPUs.
I did not recompile ACE because there weren't any code changes when I ran a git pull
. I will try it though.
@4ctrl-alt-del Thanks for looking into this. It does appear that a recompiling of both ACE and KINC solved the problem. But I'm a bit baffled by that because I remember checking if there were any updates to ACE between the time that I made this issue and when Ben said he fixed it and I remember there were none.
So as an update. My KINC settings were off a bit:
OpenCL Device: none
ACU Thread Size: 4
MPI Buffer Size: 4
Chunk Working Directory: .
Chunk Prefix: chunk
Chunk Extension: abd
Logging: on
Notice the none
on the OpenCL line. To fix this, I had to log on to a node with a GPU, make the settings change and then I could change it.
After pondering on this a bit, the setting got changed when I was running qkinc and I edited the settings to set the the OpenCL deveice to none
so that I could test KINC without it trying to use GPUs. I failed to remember that this was a persistent change between KINC runs.
I recompiled KINC today to test that the fix for issue #57 was working. Using qkinc all worked just fine. Now, I'm trying to run on 4 nodes with 16 GPUs and KINC is not behaving as expected. There is nothing reported in the log file. It does appear to be running but only on one of the nodes but it is not using the GPUs. With my previous experience, KINC seemed to start using GPUs relatively quickly.