Attach a debugger to a running job

jgphpc commented 8 years ago

@iyer-arvind is the code (runtime-)compiled with debugging flags ?

iyer-arvind commented 8 years ago

No its not.

iyer-arvind commented 8 years ago

But it should be easy to change the behaviour.

jgphpc commented 8 years ago

We managed to attach DDT to a running 4 nodes job:

/scratch/daint/piccinal/24315/TGV/
The callstack looks like this: stack_4cn.txt

eff

DDT

interactive slurm session only:
- salloc -Ag97 --time=00:10:00 --nodes=4 --ntasks=4 --cpus-per-task=1 --ntasks-per-node=1

module use /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/modules/all
module load Python/3.5.2-CrayGNU-2016.03
module load h5py/2.5.0-CrayGNU-2016.03-Python-3.5.2-parallel
module load pycuda/2016.1.2-CrayGNU-2016.03-Python-3.5.2-cuda-7.0
module load ddt/6.1

daint01: srun -N 4 -n 4 --ntasks-per-node=1 -c 1 \
  python `which pyfr` run -b cuda -p *.pyfrm TGV-4.ini

daint01: ddt &
 Attach &
/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/
    3.5.2-CrayGNU-2016.03/bin/python

eff

* Issue: ddt does not automatically stops inside cuda kernel
 100.0% [==============================> ] 0.30/0.30 ela: 00:06:22 rem: 00:00:00

jgphpc commented 8 years ago

The callstack of the job crashing with cuStreamSynchronize failed: unknown error points to the src of PyFR:

Callstack:

X=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/lib/python3.5/
site-packages/pyfr-1.4.0-py3.5.egg/pyfr/

    File "/apps/common/UES/sandbox/jgp/ebforpyfr/
easybuild/software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$Xscripts/main.py", line 109, in main
    File "$Xscripts/main.py", line 248, in process_restart
    File "$Xscripts/main.py", line 225, in _process_common
    File "$Xintegrators/base.py", line 197, in run
    File "$Xintegrators/std/controllers.py", line 72, in advance_to
    File "$Xintegrators/std/steppers.py", line 201, in step
    File "$Xsolvers/navstokes/system.py", line 43, in rhs
    File "$Xbackends/base/backend.py", line 163, in runall
    File "$Xbackends/cuda/types.py", line 133, in runall       return self
    File "$Xbackends/cuda/types.py", line 105, in _wait 
              "of the metaclasses of all its bases")
  pycuda._driver.Error: cuStreamSynchronize failed: unknown error

PyFR python src:

SRC=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/lib/python3.5/
site-packages/

unzip -l $SRC/pyfr-1.4.0-py3.5.egg|grep types.py |grep cuda
==> pyfr/backends/cuda/types.py

@iyer-arvind can you confirm this is the right place where to look at ?

PyFR-Image.tar.gz

tmp/testing/env/bin/pyfr
tmp/testing/env/bin/python
tmp/testing/env/lib/python3.4/site-packages/Mako-1.0.2-py3.4.egg
- makotemplates
tmp/testing/env/lib/python3.4/site-packages/mpmath
- Python library for arbitrary-precision floating-point arithmetic
tmp/testing/env/lib/python3.4/site-packages/py-1.4.31-py3.4
- library with cross-python path, ini-parsing, io, code, log facilities
tmp/testing/env/lib/python3.4/site-packages/pycuda-2016.1.2
tmp/testing/env/lib/python3.4/site-packages/pytools-2016.2.1
tmp/testing/env/lib/python3.4/site-packages/MarkupSafe-0.23
tmp/testing/env/lib/python3.4/site-packages/decorator-4.0.10
tmp/testing/env/lib/python3.4/site-packages/pytest-2.9.2
tmp/testing/home/pycuda/*.cubin

PyFR compiles CUDA kernels just in time if they are needed. We do not want all the 2000 nodes to call nvcc at once, so when a small run has completed, we copy over the home directory from temp to scratch, then copy over the nidxxxx.home/pycuda/*.cubin to the image home/pycuda directory. Once that is done, PyFR will not invoke nvcc.

iyer-arvind commented 8 years ago

Correct.

patrick-arm commented 8 years ago

@jgphpc Regarding "ddt does not automatically stops inside cuda kernel"

Is the pycuda-boost library compiled with "-g -G"? This would very easily explain the problem.

Edit: I read this: "is the code (runtime-)compiled with debugging flags ?" "no its not"

which confirms my suspicions.

@patrick-allinea Not sure pycuda needs recompilation: python -m pycuda.debug

jgphpc commented 8 years ago

@patrick-allinea here is a 4 nodes testcase to exercise with DDT on daint:

cd $SCRATCH/
cp -a /scratch/daint/piccinal/24315/DDT/in .  ; cd in/
sbatch 0.slurm

A successful output (/scratch/daint/piccinal/24315/DDT/JG1/slurm-498883.out) will look like this:

# --- start
 100.0% [===> ] 0.30/0.30 ela: 00:01:37 rem: 00:00:00
STATUS=0
# --- end
Copying over the home directory from root rank
JOB DONE

tend in TGV-4.ini will set the duration of the job.

patrick-arm commented 8 years ago

@jgphpc It's kind of working for me, although I can't see a thing.

To start, I did:

Interactive slurm session only:

salloc -Ag97 --time=00:10:00 --nodes=4 --ntasks=4 --cpus-per-task=1 --ntasks-per-node=1

module use /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/modules/all
module load Python/3.5.2-CrayGNU-2016.03
module load h5py/2.5.0-CrayGNU-2016.03-Python-3.5.2-parallel
module load pycuda/2016.1.2-CrayGNU-2016.03-Python-3.5.2-cuda-7.0
module load ddt/6.1

Start Allinea DDT in manual launch (listen)

Then:

srun -N 4 -n 4 --ntasks-per-node=1 -c 1 \
allinea-client --ddtsessionfile /path/to/session/daint02-1 \
python `which pyfr` \
run -b cuda -p *.pyfrm TGV-4.ini

Activate CUDA debugging.

I have no access to the source code so I don't know where to set breakpoints and/or identify the CUDA part of the code where I need to ask DDT to break into.

Edit: in the "GPU devices" tab: Ranks 0 - 3: No device Running on hosts nid03181, nid04506, nid04507, nid03180. Is this a symptom of a problem, where nvidia-smi doesn't return? Are the submission command/srun commands correct?

jgphpc commented 8 years ago

@patrick-allinea could you adapt your slurm script to include:

    module use /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/modules/all     # yes
    module use /apps/common/UES/sandbox/jgp/ebforpyfr+ddt/easybuild/modules/all # yes
    module load Python/3.5.2-CrayGNU-2016.03+debug
    module load h5py/2.5.0-CrayGNU-2016.03-Python-3.5.2-parallel
    module load pycuda/2016.1.2-CrayGNU-2016.03-Python-3.5.2-cuda-7.0

I may need to adjust the paths but could you try again ?

patrick-arm commented 8 years ago

@jgphpc there is no pyfr in your package...

patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ module load Python/3.5.2-CrayGNU-2016.03 patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ which pyfr /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr

patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ module unload Python patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ module load Python/3.5.2-CrayGNU-2016.03+debug patrickw /scratch/daint/patrickw/ALLINEA-904/batch $ which pyfr which: no pyfr in

jgphpc commented 8 years ago

Fixed but now i get:

  ImportError:
/apps/common/UES/sandbox/jgp/ebforpyfr+ddt/easybuild/software/
Python/3.5.2-CrayGNU-2016.03+debug/lib/python3.5/site-packages/
mpi4py-2.0.0-py3.5-linux-x86_64.egg/mpi4py/MPI.cpython-35m-x86_64-linux-gnu.so:
undefined symbol: MPI_File_iread_at_all

MPI_File_iread_at_all is defined in cray-mpich:

grep MPI_File_iread_at_all /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/*.so

Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libfmpich.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libfmpich_gnu_49.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libfmpich_gnu_49_mt.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libmpich.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libmpich_gnu_49.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libmpich_gnu_49_mt.so matches
Binary file /opt/cray/mpt/7.3.2/gni/mpich-gnu/49/lib/libmpl.so matches

but mpi4py points to an older /opt/cray/mpt/7.2.2/gni/mpich2-gnu/4.9/lib/libmpich_gnu_49.so.3:

ldd ...mpi4py/MPI.cpython-35m-x86_64-linux-gnu.so |grep mpich

    libmpich_gnu_49.so.3 => /opt/cray/lib64/libmpich_gnu_49.so.3 (0x00002b07d4bbf000)

Solution is to add export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

jgphpc commented 8 years ago

salloc -Ag97 --time=00:10:00 --nodes=4 --ntasks=4 --cpus-per-task=1 --ntasks-per-node=1

. ./submit.slm.cray 0001 # (srun python pyfr ...)

daint01: ddt & # /apps/common/UES/sandbox/jgp/ebforpyfr+ddt/tmp/Python/3.5.2/CrayGNU-2016.03+debug/numpy/

eff

jgphpc commented 8 years ago

@iyer-arvind i install PyFR with python setup.py install which creates an pyfr-1.4.0-py3.5.egg file. Would it be possible to cp the src code instead of an egg/zip file ?

eff

iyer-arvind commented 8 years ago

Just export PYTHONPATH=:$PYTHON_PATH and export PATH=$PATH:/pyfr/scripts

.. should work.

eth-cscs / pyfr