eth-cscs / pyfr

pyfr@cscs (https://github.com/vincentlab/PyFR)
0 stars 0 forks source link

WIP: T106D_cascade_3d job #8

Open jgphpc opened 8 years ago

jgphpc commented 8 years ago

200 nodes job

Environment

  module use /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/modules/all
  module load Python/3.5.2-CrayGNU-2016.03
  module load h5py/2.5.0-CrayGNU-2016.03-Python-3.5.2-parallel
  module load pycuda/2016.1.2-CrayGNU-2016.03-Python-3.5.2-cuda-7.0
File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/
pyfr-1.4.0-py3.5.egg/pyfr/plugins/nancheck.py", line 21, 
in __call__ RuntimeError: NaNs detected at t = 0.15000000000000036

TODO: 4000 nodes job

iyer-arvind commented 8 years ago

This does not seem to be a GPU related issue. May need to try again though!

jgphpc commented 8 years ago

Failed jobs

job464218: uncorrectable ECC error (200 nodes, crash after 1h06)

 100.0% [++++++++++++++++++++++++++> ] 224.32/224.33 
ela: 00:50:41 
rem: 00:01:20

Traceback (most recent call last):
X=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pyfr-1.4.0-py3.5.egg/pyfr/
    File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/scripts/main.py", line 109, in main
    File "$X/scripts/main.py", line 248, in process_restart
    File "$X/scripts/main.py", line 225, in _process_common
    File "$X/integrators/base.py", line 197, in run
    File "$X/integrators/std/controllers.py", line 72, in advance_to
    File "$X/integrators/std/steppers.py", line 201, in step
    File "$X/solvers/navstokes/system.py", line 55, in rhs
    File "$X/backends/base/backend.py", line 163, in runall
    File "$X/backends/cuda/types.py", line 133, in runall
      return self
    File "$X/backends/cuda/types.py", line 105, in _wait
      "of the metaclasses of all its bases")
  pycuda._driver.RuntimeError: cuStreamSynchronize failed: 
uncorrectable ECC error encountered
  Rank 31 [Fri Aug 26 16:54:48 2016] [c7-0c0s15n0] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 31
  /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93: 
UserWarning: Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))
jgphpc commented 8 years ago

200 nodes job

Summary:

cuStreamSynchronize failed: unknown error = FALLEN OFF THE BUS

stacktrace (after 18 min):

  99.9% [++++++++++++++++++++++++++> ] 224.10/224.33 ela: 00:18:34 rem: 00:40:09
Traceback (most recent call last):
    File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/pyfr/scripts/main.py", line 109, in main
    File "$X/pyfr/scripts/main.py", line 248, in process_restart
    File "$X/pyfr/scripts/main.py", line 225, in _process_common
    File "$X/pyfr/integrators/base.py", line 197, in run
    File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
    File "$X/pyfr/integrators/std/steppers.py", line 201, in step
    File "$X/pyfr/solvers/navstokes/system.py", line 43, in rhs
    File "$X/pyfr/backends/base/backend.py", line 163, in runall
    File "$X/pyfr/backends/cuda/types.py", line 133, in runall
      return self
    File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
      "of the metaclasses of all its bases")
  pycuda._driver.Error: cuStreamSynchronize failed: unknown error
  Rank 140 [Fri Aug 26 21:36:29 2016] [c8-1c1s2n1]
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 140
  /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93:
UserWarning: Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

cuStreamSynchronize failed:illegal memory access

stacktrace:

    File "/apps/common/UES/sandbox/jgp/ebforpyfr/
easybuild/software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/pyfr/scripts/main.py", line 109, in main
    File "$X/pyfr/scripts/main.py", line 248, in process_restart
    File "$X/pyfr/scripts/main.py", line 225, in _process_common
    File "$X/pyfr/integrators/base.py", line 197, in run
    File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
    File "$X/pyfr/integrators/std/steppers.py", line 201, in step
    File "$X/pyfr/solvers/navstokes/system.py", line 55, in rhs
    File "$X/pyfr/backends/base/backend.py", line 163, in runall
    File "$X/pyfr/backends/cuda/types.py", line 133, in runall return self
    File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
      "of the metaclasses of all its bases")
  pycuda._driver.LogicError: cuStreamSynchronize failed: 
an illegal memory access was encountered <-----
  Rank 45 [Sat Aug 27 01:07:13 2016] [c2-0c1s14n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 45
  /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/
3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93: UserWarning:
Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

cuStreamSynchronize failed: uncorrectable ECC error

stacktrace:

    File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/pyfr/scripts/main.py", line 109, in main
    File "$X/pyfr/scripts/main.py", line 248, in process_restart
    File "$X/pyfr/scripts/main.py", line 225, in _process_common
    File "$X/pyfr/integrators/base.py", line 197, in run
    File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
    File "$X/pyfr/integrators/std/steppers.py", line 201, in step
    File "$X/pyfr/solvers/navstokes/system.py", line 78, in rhs
    File "$X/pyfr/backends/base/backend.py", line 163, in runall
    File "$X/pyfr/backends/cuda/types.py", line 133, in runall
      return self
    File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
      "of the metaclasses of all its bases")
  pycuda._driver.RuntimeError: cuStreamSynchronize failed: uncorrectable ECC error encountered
  Rank 30 [Fri Aug 26 20:05:09 2016] [c7-0c0s15n0] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30
  /apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93:
UserWarning: Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

NaNs detected

stacktrace:

  Rank 123 [Fri Aug 26 21:30:06 2016] [c5-1c2s4n3] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 123
  Traceback (most recent call last):
    File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
    File "$X/pyfr/scripts/main.py", line 109, in main
    File "$X/pyfr/scripts/main.py", line 248, in process_restart
    File "$X/pyfr/scripts/main.py", line 225, in _process_common
    File "$X/pyfr/integrators/base.py", line 197, in run
    File "$X/pyfr/integrators/std/controllers.py", line 75, in advance_to
    File "$X/pyfr/integrators/std/controllers.py", line 36, in _accept_ste  p
    File "$X/pyfr/util.py", line 48, in __call__
    File "$X/pyfr/util.py", line 48, in <genexpr>
    File "$X/pyfr/plugins/nancheck.py", line 21, in __call__
  RuntimeError: NaNs detected at t = 224.14999999998912

success

CANCELLED DUE TO TIME LIMIT

jgphpc commented 8 years ago

@pmessmer is there a way to increase the debug level regarding the cuStreamSynchronize errors ?

Reply from Peter: Not aware of increasing verbosity of the stream sync errors but given the variety of failures I would expect these just to be the effect, not the cause. Can you try to run with setting the env variable: CUDA_LAUNCH_BLOCKING=1 So we get at least rid of some asynchronicity. Also, any chance to run under cudamcheck, maybe even with the racecheck tool ?

slurm-469074.out: OK

slurm-469077.out: OK

slurm-469075.out: OK

slurm-469076.out: OK

jgphpc commented 8 years ago

What is the number of streams used in PyFR ?

I think we use 2 streams, and i doubt if we can reduce it more. They both represent two lines of operations which can be run async.

pmessmer commented 8 years ago

see above.

jgphpc commented 8 years ago

PMI_MMAP_SYNC_WAIT_TIME

=> fixed with PMI_MMAP_SYNC_WAIT_TIME=4800

The most critical scaling issue, however, was that the very largest of jobs would frequently fail to run. When jobs requesting a large quantity of tasks (e.g., > 60,000 tasks), the job step would exit with a PMI2 failure to initialize message. In the end, we found that this was because of a difference in behavior between srun and aprun. By default, aprun copies executables prior to executing them, whereas srun does not. For most small to medium jobs, the srun behavior is probably fine, if not, “better”. However, running a large quantity of ranks directly from the parallel filesystem (lustre, DVS, whatever) would fail because the filesystem could not deliver the executable at that level of parallelism within the default ALPS timeout of 60s. The workaround is to set PMI_MMAP_SYNC_WAIT_TIME=300 in the application environment, which will increase the timeout to 300s instead of 60s. However, the solution was a feature that SchedMD implemented in later verions of 15.08 which merged the functionality of srun and sbcast (srun –bcast) to automatically copy the executable prior to execution. In 16.05 a further improvement of this to enable compression is coming. That is expected to put srun performance on the same level as aprun job startup.

jgphpc commented 8 years ago

4000 cnodes

RuntimeError: make_default_context

  Traceback (most recent call last):
    File "/tmp/iyerarv/env/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.3.0', 'console_scripts', 'pyfr')()
    File "$X/scripts/main.py", line 110, in main
    File "$X/scripts/main.py", line 253, in process_restart
    File "$X/scripts/main.py", line 210, in _process_common
    File "$X/backends/__init__.py", line 11, in get_backend
    File "$X/backends/cuda/base.py", line 33, in __init__
    File "$Y/autoinit.py", line 9, in <module>
      context = make_default_context()
    File "/tmp/iyerarv/env/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.3.0', 'console_scripts', 'pyfr')()
    File "$X/scripts/main.py", line 110, in main
    File "$X/scripts/main.py", line 253, in process_restart
    File "$X/scripts/main.py", line 210, in _process_common
    File "$X/backends/__init__.py", line 11, in get_backend
    File "$X/backends/cuda/base.py", line 33, in __init__
    File "$Y/autoinit.py", line 9, in <module>
      context = make_default_context()
    File "$Y/tools.py", line 204, in make_default_context
      "on any of the %d detected devices" % ndevices)
  RuntimeError: make_default_context() wasn't able to create a context 
on any of the 1 detected devices
  x86_64.egg/pycuda/tools.py", line 204, in make_default_context
      "on any of the %d detected devices" % ndevices)
  RuntimeError: make_default_context() wasn't able to create a context 
on any of the 1 detected devices
jgphpc commented 8 years ago

4000 cnodes

inet_connect failed (1 Sept, just after reboot)

CUDA_LAUNCH_BLOCKING=1
PMI_MMAP_SYNC_WAIT_TIME=4800

  Thu Sep  1 21:08:34 2016: 
[PE_1341]:inet_connect:inet_connect: connect failed after 301 attempts
  Thu Sep  1 21:08:34 2016: 
[PE_1341]:_pmi_inet_setup:inet_connect failed
  Thu Sep  1 21:08:34 2016: 
[PE_1341]:_pmi_init:_pmi_inet_setup (full) returned -1

The following nodes are down: 02856, 03604, 00498, 01456, 04170

jgphpc commented 8 years ago

4000 cnodes

_pmi_inet_setup:inet_connect failed

daint: pmi/5.0.10-1.0000.11050.0.0.ari
brisi: pmi/5.0.10-1.0000.11050.0.0.ari
PMI_MMAP_SYNC_WAIT_TIME=4800

job490156: ticket #24624

[PE_63]:inet_connect:inet_connect: connect failed after 301 attempts
[PE_63]:_pmi_inet_setup:inet_connect failed
[PE_63]:_pmi_init:_pmi_inet_setup (full) returned -1

Those nodes had network issues, [but] at a different time than what your job failure suggests:

2016-09-04T05:58:43.419474+02:00 c1-1c2s11n0 LNet: Quiesce start: hardware quiesce
2016-09-04T05:59:11.432366+02:00 c1-1c2s11n0 LNet: Quiesce complete: hardware quiesce
2016-09-04T15:12:33.482771+02:00 c1-1c2s11n0 WARNING: mem_cgroup_force_empty: 
memory usage of 126976 bytes or ret 0 in cgroup: /slurm/uid_23556/job_493432/step_0
crayadm@daintsmw:/var/opt/cray/log/p0-current> grep c1-1c2s11n0 console-20160905
2016-09-05T01:51:09.182267+02:00 c1-1c2s11n0 Lustre: 24521:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: 
[sent 1473032974/real 1473032974] req@ffff8807244e8800 x1544286731626040/t0(0) 
o400->snx11026-OST0075-osc-ffff88083cb62800@148.187.5.36@o2ib1013:28/4 lens 224/224 e 0 to 1 dl 1473033069 ref 1 fl Rpc:XNU/0/ffffffff rc 0/-1
2016-09-05T01:51:09.257834+02:00 c1-1c2s11n0 Lustre: 24514:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: 
[sent 1473032974/real 1473032974] req@ffff8807bfede400 x1544286731626012/t0(0) 
o400->snx11026-OST006e-osc-ffff88083cb62800@148.187.5.34@o2ib1013:28/4 lens 224/224 e 0 to 1 dl 1473033069 ref 1 fl Rpc:XNU/0/ffffffff rc 0/-1
[...]
2016-09-05T02:01:13.960753+02:00 c1-1c2s11n0 Lustre: Skipped 1 previous similar message
2016-09-05T10:02:15.112972+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: The following tests from the 'reservation' set have failed in normal mode:
2016-09-05T10:02:15.138167+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (Reservation_Test) WARNING: Directory /proc/reservations/489959 still exists
2016-09-05T10:02:15.163368+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (Reservation_Test) WARNING: Reservation 489959 has status: rid 489959 flags ENDED jobs 9809705304225
2016-09-05T10:02:15.163411+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: (Admindown) Reservation_Test
2016-09-05T10:02:15.188592+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: End of list of 1 failed test(s)
2016-09-05T10:02:17.774901+02:00 c1-0c0s0n1 <node_health:5.1> RESID:489959 (xtcheckhealth) WARNING: Set node 2284 (c1-1c2s11n0) to suspect because the node failed a health test.

job492643: ticket #24625

[PE_109]:inet_connect:inet_connect: connect failed after 301 attempts
[PE_109]:_pmi_inet_setup:inet_connect failed
[PE_109]:_pmi_init:_pmi_inet_setup (full) returned -1

job490145.4 did not failed (for 23 minutes) but got cancelled by reservation

jgphpc commented 8 years ago

CUDA_LAUNCH_BLOCKING

(200 nodes * 40 jobs) results

  export PMI_MMAP_SYNC_WAIT_TIME=4800
  job01: export CUDA_LAUNCH_BLOCKING=1
  job02: unset CUDA_LAUNCH_BLOCKING
  job03: export CUDA_LAUNCH_BLOCKING=1
  job04: unset CUDA_LAUNCH_BLOCKING

_pmi_inet_setup #24645

Tue Sep  6 01:55:21 2016: [PE_98]:_pmi_inet_setup:inet_connect failed
Tue Sep  6 01:55:21 2016: [PE_98]:_pmi_init:_pmi_inet_setup (full) returned -1
[Tue Sep  6 01:55:24 2016] [c9-0c0s3n2] 
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(547):
MPID_Init(203).......: channel initialization failed
MPID_Init(584).......:  PMI2 init failed: 1

slurmctld 4199 p0-20160901t141153 - error: Job 497289 has zero end_time https://bugs.schedmd.com/show_bug.cgi?id=3053

Expired credential #24644

   piccinal@daint01:/scratch/daint/piccinal/24315/TR2/0200cn/06 $ 
cat o_497285-loop01
    99.9% [+> ] 224.00/224.33 ela: 00:01:51 rem: 11:18:49
  slurmstepd: Munge decode failed: Expired credential
  slurmstepd: Verifying authentication credential: Expired credential
   piccinal@daint01:/scratch/daint/piccinal/24315/TR2/0200cn/07 $ 
cat o_496033-loop01
    99.9% [+> ] 224.01/224.33 ela: 00:23:31 rem: 09:05:13
  slurmstepd: Munge decode failed: Expired credential
  slurmstepd: Verifying authentication credential: Expired credential

Some nodes seem to have problems reaching back the sdb and Slurm. It looks like some /var entries are stale. There is not much more we can get from NHC.

successfull jobs

  /scratch/daint/piccinal/24315/TR2/0200cn/10/o_497282-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/09/o_497283-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/08/o_497284-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/05/o_497286-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/04/o_497287-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/03/o_497288-loop*
  /scratch/daint/piccinal/24315/TR2/0200cn/01/o_497290-loop*

  o_497282-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:55:47 rem: 00:00:00
  o_497282-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:26 rem: 00:00:00
  o_497282-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
  o_497282-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00

  o_497283-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:55:53 rem: 00:00:00
  o_497283-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
  o_497283-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:23 rem: 00:00:00
  o_497283-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00

  o_497284-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:54:00 rem: 00:00:00
  o_497284-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00
  o_497284-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:16 rem: 00:00:00
  o_497284-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00

  o_497286-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:54:09 rem: 00:00:00
  o_497286-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:17 rem: 00:00:00
  o_497286-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:19 rem: 00:00:00
  o_497286-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:19 rem: 00:00:00

  o_497287-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:34 rem: 00:00:00
  o_497287-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:31 rem: 00:00:00
  o_497287-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:24 rem: 00:00:00
  o_497287-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00

  o_497288-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:42 rem: 00:00:00
  o_497288-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00
  o_497288-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:24 rem: 00:00:00
  o_497288-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:19 rem: 00:00:00

  o_497290-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:21 rem: 00:00:00
  o_497290-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:18 rem: 00:00:00
  o_497290-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:20 rem: 00:00:00
  o_497290-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
pmessmer commented 8 years ago

Were the failing jobs the ones with CUDA_LAUNCH_BLOCKING set or unset?

And so there was no more errors of the form: cudaMemcpyHostToDevice Async error :: ret code: an illegal memory access was encountered

jgphpc commented 8 years ago

@pmessmer

Were the failing jobs the ones with CUDA_LAUNCH_BLOCKING set or unset?

  • _pmi_inet_setup: job497289/o_497289-loop0[1-4]: this failure happens with $CUDA_LAUNCH_BLOCKING=1 (job01 and job03) and without $CUDA_LAUNCH_BLOCKING=1 (job02 and job04).
  • Expired credential: job01 ($CUDA_LAUNCH_BLOCKING=1) hangs but this looks more like a slurm issue.

an illegal memory access was encountered

  • We did not get the error probably because i was lucky enough to have good gpus. It's not deterministic.
pmessmer commented 8 years ago

So the pmi_inet_setup problem is more an MPI issue than a GPU related one. Do we know what this code does before MPI_Init()?

But it seems like even without CUDA_LAUNCH_BLOCKING we are no longer getting the cudaMemcpyAsync problem. Maybe that one was only the manifestation of a different issue?

jgphpc commented 8 years ago

CUDA_LAUNCH_BLOCKING (6 Sept.)

200 nodes jobs results:

  export PMI_MMAP_SYNC_WAIT_TIME=4800
  job01: export CUDA_LAUNCH_BLOCKING=1
  job02: unset CUDA_LAUNCH_BLOCKING
  job03: export CUDA_LAUNCH_BLOCKING=1
  job04: unset CUDA_LAUNCH_BLOCKING

1 cuStreamSynchronize failed: unknown error #24661

99.9% [++++++++++++++++++++++++++> ] 224.02/224.33 ela: 00:11:14 rem: 03:22:05
Traceback (most recent call last):
  File " pyfr-1.4.0-py3.5.egg/pyfr/backends/cuda/types.py", line 105, in 
        _wait "of the metaclasses of all its bases")
pycuda._driver.Error: cuStreamSynchronize failed: unknown error
Rank 160 [Tue Sep  6 17:32:09 2016] [c9-1c1s0n2] 
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 160

c9-1c1s0n2/nid03714

NodeName=nid03714
State=IDLE+DRAIN     => not usable by user jobs
Reason=admindown by NHC [root@Ystday 17:52]
2016-09-06T18:07:28.263027+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:07:28.288164+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:07:58.311791+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GPU_TEST) WARNING: ACC: nvidia_test.c:199 ERROR -
+CUDA_ERROR_NO_DEVICE
2016-09-06T18:07:58.343542+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) Reset output: Unable to determine the PCI bus id for the
+target device: GPU is lost
2016-09-06T18:07:58.343560+02:00 c9-1c1s0n2 Error executing real-nvidia-smi -r -i 0
2016-09-06T18:07:58.343576+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:07:58.406152+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:08:24.764769+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: The following tests from the 'application' set
+have failed in suspect mode:
2016-09-06T18:08:24.790020+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GPU_TEST) WARNING: ACC: nvidia_test.c:199 ERROR -
+CUDA_ERROR_NO_DEVICE
2016-09-06T18:08:24.815219+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) Reset output: Unable to determine the PCI bus id for the
+target device: GPU is lost
2016-09-06T18:08:24.815247+02:00 c9-1c1s0n2 Error executing real-nvidia-smi -r -i 0
2016-09-06T18:08:24.840448+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:08:24.877830+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:08:24.877860+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: (Admindown) Plugin
+/opt/cray/nodehealth/default/bin/gat.sh -m 10% -r
2016-09-06T18:08:24.877873+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: End of list of 1 failed test(s)
2016-09-06T18:08:24.934588+02:00 c1-0c0s0n1 <node_health:5.1> APID:50000498868 (xtcheckhealth) WARNING: Could not set node 3714 (c9-1c1s0n2) to
+admindown because its state is admindown.

4 _pmi_inet_setup:inet_connect failed #24645

inet_connect:inet_connect: connect failed after 301 attempts
_pmi_inet_setup:inet_connect failed
_pmi_init:_pmi_inet_setup (full) returned -1
       JobID               Start                 End    Elapsed    JobName 
------------ ------------------- ------------------- ---------- ---------- 
498870.7     2016/09/06-21:02:04 2016/09/06-21:15:04   00:13:00     python 
    Tue Sep  6 21:13:49 2016: 
    [PE_80]:inet_connect:inet_connect: connect failed after 301 attempts

498870.8     2016/09/06-21:15:05 2016/09/06-21:27:17   00:12:12     python 
    Tue Sep  6 21:26:01 2016: 
    [PE_80]:inet_connect:inet_connect: connect failed after 301 attempts

498870.9     2016/09/06-21:27:19 2016/09/06-21:39:27   00:12:08     python 
    Tue Sep  6 21:38:10 2016: 
    [PE_81]:inet_connect:inet_connect: connect failed after 301 attempts

498870.10    2016/09/06-21:39:29 2016/09/06-21:51:47   00:12:18     python 
    Tue Sep  6 21:50:32 2016: 
    [PE_80]:inet_connect:inet_connect: connect failed after 301 attempts

32 successfull jobs

/scratch/daint/piccinal/24315/TR2/0200cn/01/o_498865-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/02/o_498866-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/03/o_498867-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/05/o_498869-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/07/o_498871-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/08/o_498872-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/09/o_498873-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/10/o_498874-loop0*

 100.0% [+>] 224.33/224.33 ela: 00:57:24 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 01:02:27 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:23 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 01:01:00 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:16 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:24 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 01:00:04 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:23 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 00:55:16 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 00:54:05 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:22 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 00:54:23 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00

 100.0% [+>] 224.33/224.33 ela: 00:54:11 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:52:30 rem: 00:00:00
 100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00
jgphpc commented 8 years ago

4 nodes job

100.0% [++++++++++++++++++++++++++> ] 224.10/224.10 ela: 02:50:43 rem: 00:00:00

eff