Open jgphpc opened 8 years ago
No its not.
But it should be easy to change the behaviour.
We managed to attach DDT to a running 4 nodes job:
module use /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/modules/all
module load Python/3.5.2-CrayGNU-2016.03
module load h5py/2.5.0-CrayGNU-2016.03-Python-3.5.2-parallel
module load pycuda/2016.1.2-CrayGNU-2016.03-Python-3.5.2-cuda-7.0
module load ddt/6.1
daint01: srun -N 4 -n 4 --ntasks-per-node=1 -c 1 \
python `which pyfr` run -b cuda -p *.pyfrm TGV-4.ini
daint01: ddt &
Attach &
/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/
3.5.2-CrayGNU-2016.03/bin/python
* Issue: ddt does not automatically stops inside cuda kernel
100.0% [==============================> ] 0.30/0.30 ela: 00:06:22 rem: 00:00:00
The callstack of the job crashing with cuStreamSynchronize failed: unknown error
points to the src of PyFR:
X=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/lib/python3.5/
site-packages/pyfr-1.4.0-py3.5.egg/pyfr/
File "/apps/common/UES/sandbox/jgp/ebforpyfr/
easybuild/software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
File "$Xscripts/main.py", line 109, in main
File "$Xscripts/main.py", line 248, in process_restart
File "$Xscripts/main.py", line 225, in _process_common
File "$Xintegrators/base.py", line 197, in run
File "$Xintegrators/std/controllers.py", line 72, in advance_to
File "$Xintegrators/std/steppers.py", line 201, in step
File "$Xsolvers/navstokes/system.py", line 43, in rhs
File "$Xbackends/base/backend.py", line 163, in runall
File "$Xbackends/cuda/types.py", line 133, in runall return self
File "$Xbackends/cuda/types.py", line 105, in _wait
"of the metaclasses of all its bases")
pycuda._driver.Error: cuStreamSynchronize failed: unknown error
SRC=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/lib/python3.5/
site-packages/
unzip -l $SRC/pyfr-1.4.0-py3.5.egg|grep types.py |grep cuda
==> pyfr/backends/cuda/types.py
@iyer-arvind can you confirm this is the right place where to look at ?
PyFR compiles CUDA kernels just in time if they are needed. We do not want all the 2000 nodes to call nvcc at once, so when a small run has completed, we copy over the home directory from temp to scratch, then copy over the nidxxxx.home/pycuda/*.cubin to the image home/pycuda directory. Once that is done, PyFR will not invoke nvcc.
Correct.
@jgphpc Regarding "ddt does not automatically stops inside cuda kernel"
Is the pycuda-boost library compiled with "-g -G"? This would very easily explain the problem.
Edit: I read this: "is the code (runtime-)compiled with debugging flags ?" "no its not"
which confirms my suspicions.
@patrick-allinea Not sure pycuda needs recompilation: python -m pycuda.debug
@patrick-allinea here is a 4 nodes testcase to exercise with DDT on daint:
cd $SCRATCH/
cp -a /scratch/daint/piccinal/24315/DDT/in . ; cd in/
sbatch 0.slurm
/scratch/daint/piccinal/24315/DDT/JG1/slurm-498883.out
) will look like this:# --- start
100.0% [===> ] 0.30/0.30 ela: 00:01:37 rem: 00:00:00
STATUS=0
# --- end
Copying over the home directory from root rank
JOB DONE
TGV-4.ini
will set the duration of the job.@jgphpc It's kind of working for me, although I can't see a thing.
To start, I did:
Interactive slurm session only:
module use /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/modules/all
module load Python/3.5.2-CrayGNU-2016.03
module load h5py/2.5.0-CrayGNU-2016.03-Python-3.5.2-parallel
module load pycuda/2016.1.2-CrayGNU-2016.03-Python-3.5.2-cuda-7.0
module load ddt/6.1
Then:
srun -N 4 -n 4 --ntasks-per-node=1 -c 1 \
allinea-client --ddtsessionfile /path/to/session/daint02-1 \
python `which pyfr` \
run -b cuda -p *.pyfrm TGV-4.ini
I have no access to the source code so I don't know where to set breakpoints and/or identify the CUDA part of the code where I need to ask DDT to break into.
Edit: in the "GPU devices" tab: Ranks 0 - 3: No device Running on hosts nid03181, nid04506, nid04507, nid03180. Is this a symptom of a problem, where nvidia-smi doesn't return? Are the submission command/srun commands correct?
@patrick-allinea could you adapt your slurm script to include:
module use /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/modules/all # yes
module use /apps/common/UES/sandbox/jgp/ebforpyfr+ddt/easybuild/modules/all # yes
module load Python/3.5.2-CrayGNU-2016.03+debug
module load h5py/2.5.0-CrayGNU-2016.03-Python-3.5.2-parallel
module load pycuda/2016.1.2-CrayGNU-2016.03-Python-3.5.2-cuda-7.0
@jgphpc there is no pyfr in your package...
patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ module load Python/3.5.2-CrayGNU-2016.03 patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ which pyfr /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr
patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ module unload Python patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ module load Python/3.5.2-CrayGNU-2016.03+debug patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ which pyfr which: no pyfr in
Fixed but now i get:
ImportError:
/apps/common/UES/sandbox/jgp/ebforpyfr+ddt/easybuild/software/
Python/3.5.2-CrayGNU-2016.03+debug/lib/python3.5/site-packages/
mpi4py-2.0.0-py3.5-linux-x86_64.egg/mpi4py/MPI.cpython-35m-x86_64-linux-gnu.so:
undefined symbol: MPI_File_iread_at_all
MPI_File_iread_at_all is defined in cray-mpich:
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libfmpich.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libfmpich_gnu_49.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libfmpich_gnu_49_mt.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libmpich.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libmpich_gnu_49.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libmpich_gnu_49_mt.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libmpl.so matches
but mpi4py points to an older /opt/cray/mpt/7.2.2/gni/mpich2-gnu/4.9/lib/libmpich_gnu_49.so.3:
libmpich_gnu_49.so.3 => /opt/cray/lib64/libmpich_gnu_49.so.3 (0x00002b07d4bbf000)
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
. ./submit.slm.cray 0001 # (srun python pyfr ...)
daint01: ddt & # /apps/common/UES/sandbox/jgp/ebforpyfr+ddt/tmp/Python/3.5.2/CrayGNU-2016.03+debug/numpy/
@iyer-arvind i install PyFR with python setup.py install
which creates an pyfr-1.4.0-py3.5.egg
file. Would it be possible to cp the src code instead of an egg/zip file ?
Just
export PYTHONPATH=
.. should work.
@iyer-arvind is the code (runtime-)compiled with debugging flags ?