SystemsGenetics / ACE

Accelerated Computational Engine (ACE) is a GPU-enabled framework to simplify creation of GPU-capable applications
http://SystemsGenetics.github.io/ACE
GNU General Public License v2.0
1 stars 1 forks source link

Bad args and MPI #60

Closed spficklin closed 5 years ago

spficklin commented 6 years ago

Please see issue posted on KINC project: https://github.com/SystemsGenetics/KINC/issues/48

spficklin commented 5 years ago

Copied text from KINC issue:

I reported this verbally to @4ctrl-alt-del but I thought I'd add it here for the record. When running KINC v3 on Kamiak for the first time I requested 4 nodes with 4 GPUs and used the following submission script:

#!/bin/sh
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:tesla:4
#SBATCH --time=12:00:00    
#SBATCH --job-name=SC_similarity
#SBATCH --output=logs/02-SC_similarity.log
#SBATCH --mail-type=ALL

module load gcc/6.1.0 MPI/openmpi/3.0.0 cuda/9.1.85 qt/5.10.1 ACE/dev KINC/3.2.1

srun -v --mpi=pmi2 -l kinc run similarity --input "Yeast.emx" --clus "Yeast.ccm" --corr "Yeast.cmx" --clusmethod "gmm" --corrmethod "spearman" --minexpr -inf --minsamp 15 --minclus 1 --maxclus 5 --crit "ICL" --preout TRUE --postout TRUE --mincorr 0.5 --maxcorr 1 --ksize 4096

The job launched just fine, but I observed the following incorrect behavior: 1) The processes on 3 of the 4 nodes properly bound to the GPUs, but they were using 0% of the GPU. Thus I had 12 threads assigned to 12 GPUs but there were doing nothing not even on the CPU. 2) The processes on the master node (where the MPI master was running) were not bound to the GPU and one of the processors was periodically busy. An 'strace' showed it seemed to just be looping and sleeping.

It turns out the problem was caused by me providing old arguments and not using the new updated arguments. When I changed the arguments to the following then it worked just fine and finished in 1hr 13 mins:

srun -v --mpi=pmi2 -l kinc run similarity --input Yeast.emx --ccm yeast.ccm --cmx yeast.cmx --clusmethod gmm --corrmethod spearman -minexpr -inf --minsamp 15 --minclus 1 --maxclus 5 --crit ICL --preout TRUE --postout TRUE --mincorr 0.5 --maxcorr 1 --ksize 4096

So, KINC or ACE needs to handle the situation gracefully when bad, unknown or incorrect arguments are provided.

spficklin commented 5 years ago

I'm adding the 3.0.2 milestone to this as this is a really easy mistake that someone could make and as we're planning to use this version for publication we should make sure this is fixed.

4ctrl-alt-del commented 5 years ago

So the issue here was the fact that ACE was not checking for unrecognized options, and instead just ignoring them and not reporting anything to the user. ACE now reports any unrecognized options to the user and halts execution. It is the responsibility of the application using ACE to make sure all required options have been given. Fixed in commit 9f737172b8bf06ab3b9d23aba62f87a0c528f380