200 nodes job

Environment

  module use /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/modules/all
  module load Python/3.5.2-CrayGNU-2016.03
  module load h5py/2.5.0-CrayGNU-2016.03-Python-3.5.2-parallel
  module load pycuda/2016.1.2-CrayGNU-2016.03-Python-3.5.2-cuda-7.0

cudatoolkit/7.0.28-1.0502.10742.5.1 + Driver/346.99
Inputs
X=/project/csstaff/inputs/pyfr/TR1
- (1.7K) $X/T106D_cascade_3d-Scaling.ini
- T106D_cascade_3d-1-035.200PCC-001RCPLDG.ini -> T106D_cascade_3d-Scaling.ini
- (4.8G) $X/T106D_cascade_3d-1-035.200PCC-001RCPLDG.pyfrm
- (34G) $X/T106D_cascade_3d-1-035.200PCC-001RCPLDG.pyfrs
- (34G) $X/solutions/T106D-224.320.pyfrs
- $X/solutions/TGV-4.pyfrs (751M)
- $X/TGV-4.pyfrm (38M)
- $X/TGV-4.msh (21M)
- $X/PyFR-Image.tar.gz (3.2M)
- $X/taylor_green.ini
  Run
GITJG=/apps/common/UES/sandbox/jgp/PyFR.git/JG/skype/rt.git/24315/TR1

File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/
pyfr-1.4.0-py3.5.egg/pyfr/plugins/nancheck.py", line 21, 
in __call__ RuntimeError: NaNs detected at t = 0.15000000000000036

TODO: 4000 nodes job

This does not seem to be a GPU related issue. May need to try again though!

Failed jobs

job464218: uncorrectable ECC error (200 nodes, crash after 1h06)

/scratch/daint/piccinal/24315/TR2/0200cn/00/slurm-464218.out

 100.0% [++++++++++++++++++++++++++> ] 224.32/224.33 
ela: 00:50:41 
rem: 00:01:20

Traceback (most recent call last):
X=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pyfr-1.4.0-py3.5.egg/pyfr/
    File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/scripts/main.py", line 109, in main
    File "$X/scripts/main.py", line 248, in process_restart
    File "$X/scripts/main.py", line 225, in _process_common
    File "$X/integrators/base.py", line 197, in run
    File "$X/integrators/std/controllers.py", line 72, in advance_to
    File "$X/integrators/std/steppers.py", line 201, in step
    File "$X/solvers/navstokes/system.py", line 55, in rhs
    File "$X/backends/base/backend.py", line 163, in runall
    File "$X/backends/cuda/types.py", line 133, in runall
      return self
    File "$X/backends/cuda/types.py", line 105, in _wait
      "of the metaclasses of all its bases")
  pycuda._driver.RuntimeError: cuStreamSynchronize failed: 
uncorrectable ECC error encountered
  Rank 31 [Fri Aug 26 16:54:48 2016] [c7-0c0s15n0] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 31
  /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93: 
UserWarning: Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

0 node removed (nid01404 still in queue)

200 nodes job

Summary:

31 jobs:
- 13 memory errors: 3 ECC + 9 illegal + 1 unknown
- 2 NAN errors
- 12 success
- 4 timed out

cuStreamSynchronize failed: unknown error = FALLEN OFF THE BUS

unknown error: slurm-464938.out
- NODENAME=c8-1c1s2n1/NID=3529 has fallen of the bus

stacktrace (after 18 min):

  99.9% [++++++++++++++++++++++++++> ] 224.10/224.33 ela: 00:18:34 rem: 00:40:09
Traceback (most recent call last):
    File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/pyfr/scripts/main.py", line 109, in main
    File "$X/pyfr/scripts/main.py", line 248, in process_restart
    File "$X/pyfr/scripts/main.py", line 225, in _process_common
    File "$X/pyfr/integrators/base.py", line 197, in run
    File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
    File "$X/pyfr/integrators/std/steppers.py", line 201, in step
    File "$X/pyfr/solvers/navstokes/system.py", line 43, in rhs
    File "$X/pyfr/backends/base/backend.py", line 163, in runall
    File "$X/pyfr/backends/cuda/types.py", line 133, in runall
      return self
    File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
      "of the metaclasses of all its bases")
  pycuda._driver.Error: cuStreamSynchronize failed: unknown error
  Rank 140 [Fri Aug 26 21:36:29 2016] [c8-1c1s2n1]
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 140
  /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93:
UserWarning: Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

cuStreamSynchronize failed:illegal memory access

illegal memory access: slurm-465286.out
illegal memory access: slurm-464934.out
illegal memory access: slurm-465292.out
illegal memory access: slurm-465294.out
illegal memory access: slurm-464942.out
illegal memory access: slurm-464945.out
illegal memory access: slurm-464946.out
illegal memory access: slurm-464950.out
illegal memory access: slurm-465287.out

stacktrace:

X=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/ 3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pyfr-1.4.0-py3.5.egg

    File "/apps/common/UES/sandbox/jgp/ebforpyfr/
easybuild/software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/pyfr/scripts/main.py", line 109, in main
    File "$X/pyfr/scripts/main.py", line 248, in process_restart
    File "$X/pyfr/scripts/main.py", line 225, in _process_common
    File "$X/pyfr/integrators/base.py", line 197, in run
    File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
    File "$X/pyfr/integrators/std/steppers.py", line 201, in step
    File "$X/pyfr/solvers/navstokes/system.py", line 55, in rhs
    File "$X/pyfr/backends/base/backend.py", line 163, in runall
    File "$X/pyfr/backends/cuda/types.py", line 133, in runall return self
    File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
      "of the metaclasses of all its bases")
  pycuda._driver.LogicError: cuStreamSynchronize failed: 
an illegal memory access was encountered <-----
  Rank 45 [Sat Aug 27 01:07:13 2016] [c2-0c1s14n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 45
  /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/
3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93: UserWarning:
Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

cuStreamSynchronize failed: uncorrectable ECC error

ECC error: slurm-464933.out
ECC error: slurm-464936.out
ECC error: slurm-464939.out

stacktrace:

X=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/

    File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/pyfr/scripts/main.py", line 109, in main
    File "$X/pyfr/scripts/main.py", line 248, in process_restart
    File "$X/pyfr/scripts/main.py", line 225, in _process_common
    File "$X/pyfr/integrators/base.py", line 197, in run
    File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
    File "$X/pyfr/integrators/std/steppers.py", line 201, in step
    File "$X/pyfr/solvers/navstokes/system.py", line 78, in rhs
    File "$X/pyfr/backends/base/backend.py", line 163, in runall
    File "$X/pyfr/backends/cuda/types.py", line 133, in runall
      return self
    File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
      "of the metaclasses of all its bases")
  pycuda._driver.RuntimeError: cuStreamSynchronize failed: uncorrectable ECC error encountered
  Rank 30 [Fri Aug 26 20:05:09 2016] [c7-0c0s15n0] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30
  /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93:
UserWarning: Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

NaNs detected

NAN: slurm-464937.out
NAN: slurm-465297.out

stacktrace:

  Rank 123 [Fri Aug 26 21:30:06 2016] [c5-1c2s4n3] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 123
  Traceback (most recent call last):
    File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/pyfr/scripts/main.py", line 109, in main
    File "$X/pyfr/scripts/main.py", line 248, in process_restart
    File "$X/pyfr/scripts/main.py", line 225, in _process_common
    File "$X/pyfr/integrators/base.py", line 197, in run
    File "$X/pyfr/integrators/std/controllers.py", line 75, in advance_to
    File "$X/pyfr/integrators/std/controllers.py", line 36, in _accept_ste  p
    File "$X/pyfr/util.py", line 48, in __call__
    File "$X/pyfr/util.py", line 48, in <genexpr>
    File "$X/pyfr/plugins/nancheck.py", line 21, in __call__
  RuntimeError: NaNs detected at t = 224.14999999998912

success

success: slurm-465285.out
success: slurm-465288.out
success: slurm-464935.out
success: slurm-465293.out
success: slurm-465295.out
success: slurm-464941.out
success: slurm-465296.out
success: slurm-464943.out
success: slurm-464944.out
success: slurm-464947.out
success: slurm-464948.out
success: slurm-464949.out

CANCELLED DUE TO TIME LIMIT

timed out: slurm-465289.out
timed out: slurm-465290.out
timed out: slurm-465291.out
timed out: slurm-464940.out

@pmessmer is there a way to increase the debug level regarding the cuStreamSynchronize errors ?

Reply from Peter: Not aware of increasing verbosity of the stream sync errors but given the variety of failures I would expect these just to be the effect, not the cause. Can you try to run with setting the env variable: CUDA_LAUNCH_BLOCKING=1 So we get at least rid of some asynchronicity. Also, any chance to run under cudamcheck, maybe even with the racecheck tool ?

slurm-469074.out: OK

CUDA_LAUNCH_BLOCKING=yes + PMI_MMAP_SYNC_WAIT_TIME=4800 + 200cn:
- o_469074-loop01: 100.0% [+> ] 224.33/224.33 ela: 00:55:13 rem: 00:00:00
- o_469074-loop02: 100.0% [+> ] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
- o_469074-loop03: 100.0% [+> ] 224.33/224.33 ela: 00:52:29 rem: 00:00:00
- o_469074-loop04: 100.0% [+> ] 224.33/224.33 ela: 00:52:24 rem: 00:00:00
- o_469074-loop05: 100.0% [+> ] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
- o_469074-loop06: 100.0% [+> ] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
- o_469074-loop07: 100.0% [+> ] 224.33/224.33 ela: 00:52:40 rem: 00:00:00
- o_469074-loop08: 100.0% [+> ] 224.33/224.33 ela: 00:52:27 rem: 00:00:00
- o_469074-loop09: 100.0% [+> ] 224.33/224.33 ela: 00:52:23 rem: 00:00:00
- o_469074-loop10: 100.0% [+> ] 224.33/224.33 ela: 00:52:22 rem: 00:00:00
- o_469074-loop11: 100.0% [+> ] 224.33/224.33 ela: 00:52:19 rem: 00:00:00
- o_469074-loop12: 100.0% [+> ] 224.33/224.33 ela: 00:52:24 rem: 00:00:00
0 node removed

slurm-469077.out: OK

CUDA_LAUNCH_BLOCKING=no + PMI_MMAP_SYNC_WAIT_TIME=4800 + 200cn:
- o_469077-loop01: 100.0% [+++> ] 224.33/224.33 ela: 00:52:01 rem: 00:00:00
- o_469077-loop02: 100.0% [+++> ] 224.33/224.33 ela: 00:50:35 rem: 00:00:00
- o_469077-loop03: 100.0% [+++> ] 224.33/224.33 ela: 00:50:28 rem: 00:00:00
- o_469077-loop04: 100.0% [+++> ] 224.33/224.33 ela: 00:50:22 rem: 00:00:00
- o_469077-loop05: 100.0% [+++> ] 224.33/224.33 ela: 00:50:46 rem: 00:00:00
- o_469077-loop06: 100.0% [+++> ] 224.33/224.33 ela: 00:50:26 rem: 00:00:00
- o_469077-loop07: 100.0% [+++> ] 224.33/224.33 ela: 00:50:28 rem: 00:00:00
- o_469077-loop08: 100.0% [+++> ] 224.33/224.33 ela: 00:50:24 rem: 00:00:00
- o_469077-loop09: 100.0% [+++> ] 224.33/224.33 ela: 00:50:28 rem: 00:00:00
- o_469077-loop10: 100.0% [+++> ] 224.33/224.33 ela: 00:50:29 rem: 00:00:00
- o_469077-loop11: 100.0% [+++> ] 224.33/224.33 ela: 00:50:23 rem: 00:00:00
- o_469077-loop12: 100.0% [+++> ] 224.33/224.33 ela: 00:50:28 rem: 00:00:00
0 node removed

slurm-469075.out: OK

CUDA_LAUNCH_BLOCKING=no + PMI_MMAP_SYNC_WAIT_TIME=4800 + 200cn:
- o_469075-loop01: 100.0% [+++> ] 224.33/224.33 ela: 00:52:20 rem: 00:00:00
- o_469075-loop02: 100.0% [+++> ] 224.33/224.33 ela: 00:50:23 rem: 00:00:00
- o_469075-loop03: 100.0% [+++> ] 224.33/224.33 ela: 00:50:46 rem: 00:00:00
- o_469075-loop04: 100.0% [+++> ] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
- o_469075-loop05: 100.0% [+++> ] 224.33/224.33 ela: 00:50:23 rem: 00:00:00
- o_469075-loop06: 100.0% [+++> ] 224.33/224.33 ela: 00:50:26 rem: 00:00:00
- o_469075-loop07: 100.0% [+++> ] 224.33/224.33 ela: 00:50:26 rem: 00:00:00
- o_469075-loop08: 100.0% [+++> ] 224.33/224.33 ela: 00:50:30 rem: 00:00:00
- o_469075-loop09: 100.0% [+++> ] 224.33/224.33 ela: 00:50:24 rem: 00:00:00
- o_469075-loop10: 100.0% [+++> ] 224.33/224.33 ela: 00:50:23 rem: 00:00:00
- o_469075-loop11: 100.0% [+++> ] 224.33/224.33 ela: 00:50:35 rem: 00:00:00
- o_469075-loop12: 100.0% [+++> ] 224.33/224.33 ela: 00:50:27 rem: 00:00:00
0 node removed

slurm-469076.out: OK

CUDA_LAUNCH_BLOCKING=no + PMI_MMAP_SYNC_WAIT_TIME=4800 + 200cn:
- o_469076-loop01: 100.0% [+> ] 224.33/224.33 ela: 00:54:01 rem: 00:00:00
- o_469076-loop02: 100.0% [+> ] 224.33/224.33 ela: 00:52:20 rem: 00:00:00
- o_469076-loop03: 100.0% [+> ] 224.33/224.33 ela: 00:52:38 rem: 00:00:00
- o_469076-loop04: 100.0% [+> ] 224.33/224.33 ela: 00:52:20 rem: 00:00:00
- o_469076-loop05: 100.0% [+> ] 224.33/224.33 ela: 00:52:18 rem: 00:00:00
- o_469076-loop06: 100.0% [+> ] 224.33/224.33 ela: 00:52:43 rem: 00:00:00
- o_469076-loop07: 100.0% [+> ] 224.33/224.33 ela: 00:52:20 rem: 00:00:00
- o_469076-loop08: 100.0% [+> ] 224.33/224.33 ela: 00:52:22 rem: 00:00:00
- o_469076-loop09: 100.0% [+> ] 224.33/224.33 ela: 00:52:22 rem: 00:00:00
- o_469076-loop10: 100.0% [+> ] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
- o_469076-loop11: 100.0% [+> ] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
- o_469076-loop12: 100.0% [+> ] 224.33/224.33 ela: 00:52:44 rem: 00:00:00
0 node removed

What is the number of streams used in PyFR ?

I think we use 2 streams, and i doubt if we can reduce it more. They both represent two lines of operations which can be run async.

PMI_MMAP_SYNC_WAIT_TIME

200cn + _PMI_MMAP_SYNC_WAITTIME=3600
slurm-466131.out/job01-job12: Fatal error in MPI_Init
slurm-466127.out/job01-job12: Fatal error in MPI_Init

=> fixed with PMI_MMAP_SYNC_WAIT_TIME=4800

pap173:

The most critical scaling issue, however, was that the very largest of jobs would frequently fail to run. When jobs requesting a large quantity of tasks (e.g., > 60,000 tasks), the job step would exit with a PMI2 failure to initialize message. In the end, we found that this was because of a difference in behavior between srun and aprun. By default, aprun copies executables prior to executing them, whereas srun does not. For most small to medium jobs, the srun behavior is probably fine, if not, “better”. However, running a large quantity of ranks directly from the parallel filesystem (lustre, DVS, whatever) would fail because the filesystem could not deliver the executable at that level of parallelism within the default ALPS timeout of 60s. The workaround is to set PMI_MMAP_SYNC_WAIT_TIME=300 in the application environment, which will increase the timeout to 300s instead of 60s. However, the solution was a feature that SchedMD implemented in later verions of 15.08 which merged the functionality of srun and sbcast (srun –bcast) to automatically copy the executable prior to execution. In 16.05 a further improvement of this to enable compression is coming. That is expected to put srun performance on the same level as aprun job startup.

slurm ticket

4000 cnodes

RuntimeError: make_default_context

make_default_context error: job476162 (31 Aug.)
make_default_context error: job476172 (31 Aug.)
- X=/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/pyfr
- Y=/tmp/iyerarv/env/lib/python3.4/site-packages/pycuda-2016.1.2-py3.4-linux-x86_64.egg/pycuda

  Traceback (most recent call last):
    File "/tmp/iyerarv/env/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.3.0', 'console_scripts', 'pyfr')()
    File "$X/scripts/main.py", line 110, in main
    File "$X/scripts/main.py", line 253, in process_restart
    File "$X/scripts/main.py", line 210, in _process_common
    File "$X/backends/__init__.py", line 11, in get_backend
    File "$X/backends/cuda/base.py", line 33, in __init__
    File "$Y/autoinit.py", line 9, in <module>
      context = make_default_context()
    File "/tmp/iyerarv/env/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.3.0', 'console_scripts', 'pyfr')()
    File "$X/scripts/main.py", line 110, in main
    File "$X/scripts/main.py", line 253, in process_restart
    File "$X/scripts/main.py", line 210, in _process_common
    File "$X/backends/__init__.py", line 11, in get_backend
    File "$X/backends/cuda/base.py", line 33, in __init__
    File "$Y/autoinit.py", line 9, in <module>
      context = make_default_context()
    File "$Y/tools.py", line 204, in make_default_context
      "on any of the %d detected devices" % ndevices)
  RuntimeError: make_default_context() wasn't able to create a context 
on any of the 1 detected devices
  x86_64.egg/pycuda/tools.py", line 204, in make_default_context
      "on any of the %d detected devices" % ndevices)
  RuntimeError: make_default_context() wasn't able to create a context 
on any of the 1 detected devices

4000 cnodes

inet_connect failed (1 Sept, just after reboot)

/project/csstaff/inputs/pyfr/logs/4000cn/job490133
job490133.4 failed with:

CUDA_LAUNCH_BLOCKING=1
PMI_MMAP_SYNC_WAIT_TIME=4800

  Thu Sep  1 21:08:34 2016: 
[PE_1341]:inet_connect:inet_connect: connect failed after 301 attempts
  Thu Sep  1 21:08:34 2016: 
[PE_1341]:_pmi_inet_setup:inet_connect failed
  Thu Sep  1 21:08:34 2016: 
[PE_1341]:_pmi_init:_pmi_inet_setup (full) returned -1

The following nodes are down: 02856, 03604, 00498, 01456, 04170

4000 cnodes

_pmi_inet_setup:inet_connect failed

daint: pmi/5.0.10-1.0000.11050.0.0.ari
brisi: pmi/5.0.10-1.0000.11050.0.0.ari
PMI_MMAP_SYNC_WAIT_TIME=4800

job490156: ticket #24624

[PE_63]:inet_connect:inet_connect: connect failed after 301 attempts
[PE_63]:_pmi_inet_setup:inet_connect failed
[PE_63]:_pmi_init:_pmi_inet_setup (full) returned -1

nodes down: nid02219, nid02284, nid03220, nid03331, nid03775, nid04035, nid05022

Those nodes had network issues, [but] at a different time than what your job failure suggests:

2016-09-04T05:58:43.419474+02:00 c1-1c2s11n0 LNet: Quiesce start: hardware quiesce
2016-09-04T05:59:11.432366+02:00 c1-1c2s11n0 LNet: Quiesce complete: hardware quiesce
2016-09-04T15:12:33.482771+02:00 c1-1c2s11n0 WARNING: mem_cgroup_force_empty: 
memory usage of 126976 bytes or ret 0 in cgroup: /slurm/uid_23556/job_493432/step_0

nid02284/c1-1c2s11n0

crayadm@daintsmw:/var/opt/cray/log/p0-current> grep c1-1c2s11n0 console-20160905
2016-09-05T01:51:09.182267+02:00 c1-1c2s11n0 Lustre: 24521:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: 
[sent 1473032974/real 1473032974] req@ffff8807244e8800 x1544286731626040/t0(0) 
o400->snx11026-OST0075-osc-ffff88083cb62800@148.187.5.36@o2ib1013:28/4 lens 224/224 e 0 to 1 dl 1473033069 ref 1 fl Rpc:XNU/0/ffffffff rc 0/-1
2016-09-05T01:51:09.257834+02:00 c1-1c2s11n0 Lustre: 24514:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: 
[sent 1473032974/real 1473032974] req@ffff8807bfede400 x1544286731626012/t0(0) 
o400->snx11026-OST006e-osc-ffff88083cb62800@148.187.5.34@o2ib1013:28/4 lens 224/224 e 0 to 1 dl 1473033069 ref 1 fl Rpc:XNU/0/ffffffff rc 0/-1
[...]
2016-09-05T02:01:13.960753+02:00 c1-1c2s11n0 Lustre: Skipped 1 previous similar message
2016-09-05T10:02:15.112972+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: The following tests from the 'reservation' set have failed in normal mode:
2016-09-05T10:02:15.138167+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (Reservation_Test) WARNING: Directory /proc/reservations/489959 still exists
2016-09-05T10:02:15.163368+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (Reservation_Test) WARNING: Reservation 489959 has status: rid 489959 flags ENDED jobs 9809705304225
2016-09-05T10:02:15.163411+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: (Admindown) Reservation_Test
2016-09-05T10:02:15.188592+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: End of list of 1 failed test(s)
2016-09-05T10:02:17.774901+02:00 c1-0c0s0n1 <node_health:5.1> RESID:489959 (xtcheckhealth) WARNING: Set node 2284 (c1-1c2s11n0) to suspect because the node failed a health test.

job492643: ticket #24625

[PE_109]:inet_connect:inet_connect: connect failed after 301 attempts
[PE_109]:_pmi_inet_setup:inet_connect failed
[PE_109]:_pmi_init:_pmi_inet_setup (full) returned -1

nodes down: nid02015, nid02023, nid02219, nid02284, nid03220, nid03331, nid03775, nid04035, nid05022

job490145.4 did not failed (for 23 minutes) but got cancelled by reservation

2011 Cray bug

PMI 2.1.4: 775317, 777046 inet_connect: connect failed after 31 attempts"

CUDA_LAUNCH_BLOCKING

(200 nodes * 40 jobs) results

Results of running 40 jobs:
- half jobs with $CUDA_LAUNCH_BLOCKING=1, half jobs without
- 28 jobs did not fail
- 12 jobs failed:
- 4 jobs failed with: _pmi_inet_setup:inet_connect failed (24645)
- 2 jobs failed with: Expired credential (24644)
- the remaining jobs did not have a chance to start
All logfiles can be found in /project/csstaff/inputs/pyfr/logs/0200cn/

  export PMI_MMAP_SYNC_WAIT_TIME=4800
  job01: export CUDA_LAUNCH_BLOCKING=1
  job02: unset CUDA_LAUNCH_BLOCKING
  job03: export CUDA_LAUNCH_BLOCKING=1
  job04: unset CUDA_LAUNCH_BLOCKING

_pmi_inet_setup #24645

job497289/o_497289-loop0[1-4]

Tue Sep  6 01:55:21 2016: [PE_98]:_pmi_inet_setup:inet_connect failed
Tue Sep  6 01:55:21 2016: [PE_98]:_pmi_init:_pmi_inet_setup (full) returned -1
[Tue Sep  6 01:55:24 2016] [c9-0c0s3n2] 
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(547):
MPID_Init(203).......: channel initialization failed
MPID_Init(584).......:  PMI2 init failed: 1

slurmctld 4199 p0-20160901t141153 - error: Job 497289 has zero end_time https://bugs.schedmd.com/show_bug.cgi?id=3053

Expired credential #24644

job497285/o_497285-loop01

   piccinal@daint01:/scratch/daint/piccinal/24315/TR2/0200cn/06 $ 
cat o_497285-loop01
    99.9% [+> ] 224.00/224.33 ela: 00:01:51 rem: 11:18:49
  slurmstepd: Munge decode failed: Expired credential
  slurmstepd: Verifying authentication credential: Expired credential

job496033/o_496033-loop01

   piccinal@daint01:/scratch/daint/piccinal/24315/TR2/0200cn/07 $ 
cat o_496033-loop01
    99.9% [+> ] 224.01/224.33 ela: 00:23:31 rem: 09:05:13
  slurmstepd: Munge decode failed: Expired credential
  slurmstepd: Verifying authentication credential: Expired credential

Some nodes seem to have problems reaching back the sdb and Slurm. It looks like some /var entries are stale. There is not much more we can get from NHC.

successfull jobs

  /scratch/daint/piccinal/24315/TR2/0200cn/10/o_497282-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/09/o_497283-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/08/o_497284-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/05/o_497286-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/04/o_497287-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/03/o_497288-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/01/o_497290-loop*

  o_497282-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:55:47 rem: 00:00:00
  o_497282-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:26 rem: 00:00:00
  o_497282-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
  o_497282-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00

  o_497283-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:55:53 rem: 00:00:00
  o_497283-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
  o_497283-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:23 rem: 00:00:00
  o_497283-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00

  o_497284-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:54:00 rem: 00:00:00
  o_497284-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00
  o_497284-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:16 rem: 00:00:00
  o_497284-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00

  o_497286-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:54:09 rem: 00:00:00
  o_497286-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:17 rem: 00:00:00
  o_497286-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:19 rem: 00:00:00
  o_497286-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:19 rem: 00:00:00

  o_497287-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:34 rem: 00:00:00
  o_497287-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:31 rem: 00:00:00
  o_497287-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:24 rem: 00:00:00
  o_497287-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00

  o_497288-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:42 rem: 00:00:00
  o_497288-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00
  o_497288-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:24 rem: 00:00:00
  o_497288-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:19 rem: 00:00:00

  o_497290-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:21 rem: 00:00:00
  o_497290-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:18 rem: 00:00:00
  o_497290-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:20 rem: 00:00:00
  o_497290-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00

Were the failing jobs the ones with CUDA_LAUNCH_BLOCKING set or unset?

And so there was no more errors of the form: cudaMemcpyHostToDevice Async error :: ret code: an illegal memory access was encountered

@pmessmer

Were the failing jobs the ones with CUDA_LAUNCH_BLOCKING set or unset?

_pmi_inet_setup: job497289/o_497289-loop0[1-4]: this failure happens with $CUDA_LAUNCH_BLOCKING=1 (job01 and job03) and without $CUDA_LAUNCH_BLOCKING=1 (job02 and job04).

Expired credential: job01 ($CUDA_LAUNCH_BLOCKING=1) hangs but this looks more like a slurm issue.

an illegal memory access was encountered

We did not get the error probably because i was lucky enough to have good gpus. It's not deterministic.

So the pmi_inet_setup problem is more an MPI issue than a GPU related one. Do we know what this code does before MPI_Init()?

But it seems like even without CUDA_LAUNCH_BLOCKING we are no longer getting the cudaMemcpyAsync problem. Maybe that one was only the manifestation of a different issue?

CUDA_LAUNCH_BLOCKING (6 Sept.)

200 nodes jobs results:

Results of running 40 jobs:
- half jobs with $CUDA_LAUNCH_BLOCKING=1, half jobs without
- 32 jobs did not fail
- 08 jobs failed:
- 1 job failed with: cuStreamSynchronize failed: unknown error
  - then 3 failed with: cuInit failed
- 4 jobs failed with: _pmi_inet_setup:inet_connect failed
All logfiles can be found in /project/csstaff/inputs/pyfr/logs/0200cn/

  export PMI_MMAP_SYNC_WAIT_TIME=4800
  job01: export CUDA_LAUNCH_BLOCKING=1
  job02: unset CUDA_LAUNCH_BLOCKING
  job03: export CUDA_LAUNCH_BLOCKING=1
  job04: unset CUDA_LAUNCH_BLOCKING

X=/project/csstaff/inputs/pyfr/logs/0200cn/

1 cuStreamSynchronize failed: unknown error #24661

$X/cuStreamSynchronize_failed-unknown_error_fallen/498868

99.9% [++++++++++++++++++++++++++> ] 224.02/224.33 ela: 00:11:14 rem: 03:22:05
Traceback (most recent call last):
  File " pyfr-1.4.0-py3.5.egg/pyfr/backends/cuda/types.py", line 105, in 
        _wait "of the metaclasses of all its bases")
pycuda._driver.Error: cuStreamSynchronize failed: unknown error
Rank 160 [Tue Sep  6 17:32:09 2016] [c9-1c1s0n2] 
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 160

Next 3 jobs failed with: cuInit failed: no CUDA-capable device is detected

c9-1c1s0n2/nid03714

scontrol show node nid03714

NodeName=nid03714
State=IDLE+DRAIN     => not usable by user jobs
Reason=admindown by NHC [root@Ystday 17:52]

2016-09-06T18:07:28.263027+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:07:28.288164+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:07:58.311791+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GPU_TEST) WARNING: ACC: nvidia_test.c:199 ERROR -
+CUDA_ERROR_NO_DEVICE
2016-09-06T18:07:58.343542+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) Reset output: Unable to determine the PCI bus id for the
+target device: GPU is lost
2016-09-06T18:07:58.343560+02:00 c9-1c1s0n2 Error executing real-nvidia-smi -r -i 0
2016-09-06T18:07:58.343576+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:07:58.406152+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:08:24.764769+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: The following tests from the 'application' set
+have failed in suspect mode:
2016-09-06T18:08:24.790020+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GPU_TEST) WARNING: ACC: nvidia_test.c:199 ERROR -
+CUDA_ERROR_NO_DEVICE
2016-09-06T18:08:24.815219+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) Reset output: Unable to determine the PCI bus id for the
+target device: GPU is lost
2016-09-06T18:08:24.815247+02:00 c9-1c1s0n2 Error executing real-nvidia-smi -r -i 0
2016-09-06T18:08:24.840448+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:08:24.877830+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:08:24.877860+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: (Admindown) Plugin
+/opt/cray/nodehealth/default/bin/gat.sh -m 10% -r
2016-09-06T18:08:24.877873+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: End of list of 1 failed test(s)
2016-09-06T18:08:24.934588+02:00 c1-0c0s0n1 <node_health:5.1> APID:50000498868 (xtcheckhealth) WARNING: Could not set node 3714 (c9-1c1s0n2) to
+admindown because its state is admindown.

4 _pmi_inet_setup:inet_connect failed #24645

$X/pmi_inet_setup/498870

inet_connect:inet_connect: connect failed after 301 attempts
_pmi_inet_setup:inet_connect failed
_pmi_init:_pmi_inet_setup (full) returned -1

Failure occurs at the beginning of the job:

       JobID               Start                 End    Elapsed    JobName 
------------ ------------------- ------------------- ---------- ---------- 
498870.7     2016/09/06-21:02:04 2016/09/06-21:15:04   00:13:00     python 
    Tue Sep  6 21:13:49 2016: 
    [PE_80]:inet_connect:inet_connect: connect failed after 301 attempts

498870.8     2016/09/06-21:15:05 2016/09/06-21:27:17   00:12:12     python 
    Tue Sep  6 21:26:01 2016: 
    [PE_80]:inet_connect:inet_connect: connect failed after 301 attempts

498870.9     2016/09/06-21:27:19 2016/09/06-21:39:27   00:12:08     python 
    Tue Sep  6 21:38:10 2016: 
    [PE_81]:inet_connect:inet_connect: connect failed after 301 attempts

498870.10    2016/09/06-21:39:29 2016/09/06-21:51:47   00:12:18     python 
    Tue Sep  6 21:50:32 2016: 
    [PE_80]:inet_connect:inet_connect: connect failed after 301 attempts

32 successfull jobs

/scratch/daint/piccinal/24315/TR2/0200cn/01/o_498865-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/02/o_498866-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/03/o_498867-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/05/o_498869-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/07/o_498871-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/08/o_498872-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/09/o_498873-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/10/o_498874-loop0*

 100.0% [+>] 224.33/224.33 ela: 00:57:24 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 01:02:27 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:23 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 01:01:00 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:16 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:24 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 01:00:04 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:23 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 00:55:16 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 00:54:05 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:22 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 00:54:23 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 00:54:11 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:30 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00

4 nodes job

No GPU failure found

100.0% [++++++++++++++++++++++++++> ] 224.10/224.10 ela: 02:50:43 rem: 00:00:00

eff

eth-cscs / pyfr

WIP: T106D_cascade_3d job #8

200 nodes job

Environment

Inputs

Run

TODO: 4000 nodes job

Failed jobs

job464218: uncorrectable ECC error (200 nodes, crash after 1h06)

200 nodes job

Summary:

cuStreamSynchronize failed: unknown error = FALLEN OFF THE BUS

stacktrace (after 18 min):

cuStreamSynchronize failed:illegal memory access

stacktrace:

cuStreamSynchronize failed: uncorrectable ECC error

stacktrace:

NaNs detected

stacktrace:

success

CANCELLED DUE TO TIME LIMIT

slurm-469074.out: OK

slurm-469077.out: OK

slurm-469075.out: OK

slurm-469076.out: OK

What is the number of streams used in PyFR ?

PMI_MMAP_SYNC_WAIT_TIME

4000 cnodes

RuntimeError: make_default_context

4000 cnodes

inet_connect failed (1 Sept, just after reboot)

4000 cnodes

_pmi_inet_setup:inet_connect failed

job490156: ticket #24624

job492643: ticket #24625

job490145.4 did not failed (for 23 minutes) but got cancelled by reservation

CUDA_LAUNCH_BLOCKING

(200 nodes * 40 jobs) results

_pmi_inet_setup #24645

Expired credential #24644

successfull jobs

CUDA_LAUNCH_BLOCKING (6 Sept.)

200 nodes jobs results:

1 cuStreamSynchronize failed: unknown error #24661

c9-1c1s0n2/nid03714

4 _pmi_inet_setup:inet_connect failed #24645

32 successfull jobs

4 nodes job