Open jgphpc opened 8 years ago
This does not seem to be a GPU related issue. May need to try again though!
100.0% [++++++++++++++++++++++++++> ] 224.32/224.33
ela: 00:50:41
rem: 00:01:20
Traceback (most recent call last):
X=/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pyfr-1.4.0-py3.5.egg/pyfr/
File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/
software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
File "$X/scripts/main.py", line 109, in main
File "$X/scripts/main.py", line 248, in process_restart
File "$X/scripts/main.py", line 225, in _process_common
File "$X/integrators/base.py", line 197, in run
File "$X/integrators/std/controllers.py", line 72, in advance_to
File "$X/integrators/std/steppers.py", line 201, in step
File "$X/solvers/navstokes/system.py", line 55, in rhs
File "$X/backends/base/backend.py", line 163, in runall
File "$X/backends/cuda/types.py", line 133, in runall
return self
File "$X/backends/cuda/types.py", line 105, in _wait
"of the metaclasses of all its bases")
pycuda._driver.RuntimeError: cuStreamSynchronize failed:
uncorrectable ECC error encountered
Rank 31 [Fri Aug 26 16:54:48 2016] [c7-0c0s15n0] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 31
/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93:
UserWarning: Prefork server exiting upon apparent death of parent
warn("%s exiting upon apparent death of %s" % (who, partner))
fallen of the bus
99.9% [++++++++++++++++++++++++++> ] 224.10/224.33 ela: 00:18:34 rem: 00:40:09
Traceback (most recent call last):
File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
File "$X/pyfr/scripts/main.py", line 109, in main
File "$X/pyfr/scripts/main.py", line 248, in process_restart
File "$X/pyfr/scripts/main.py", line 225, in _process_common
File "$X/pyfr/integrators/base.py", line 197, in run
File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
File "$X/pyfr/integrators/std/steppers.py", line 201, in step
File "$X/pyfr/solvers/navstokes/system.py", line 43, in rhs
File "$X/pyfr/backends/base/backend.py", line 163, in runall
File "$X/pyfr/backends/cuda/types.py", line 133, in runall
return self
File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
"of the metaclasses of all its bases")
pycuda._driver.Error: cuStreamSynchronize failed: unknown error
Rank 140 [Fri Aug 26 21:36:29 2016] [c8-1c1s2n1]
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 140
/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93:
UserWarning: Prefork server exiting upon apparent death of parent
warn("%s exiting upon apparent death of %s" % (who, partner))
File "/apps/common/UES/sandbox/jgp/ebforpyfr/
easybuild/software/Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
File "$X/pyfr/scripts/main.py", line 109, in main
File "$X/pyfr/scripts/main.py", line 248, in process_restart
File "$X/pyfr/scripts/main.py", line 225, in _process_common
File "$X/pyfr/integrators/base.py", line 197, in run
File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
File "$X/pyfr/integrators/std/steppers.py", line 201, in step
File "$X/pyfr/solvers/navstokes/system.py", line 55, in rhs
File "$X/pyfr/backends/base/backend.py", line 163, in runall
File "$X/pyfr/backends/cuda/types.py", line 133, in runall return self
File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
"of the metaclasses of all its bases")
pycuda._driver.LogicError: cuStreamSynchronize failed:
an illegal memory access was encountered <-----
Rank 45 [Sat Aug 27 01:07:13 2016] [c2-0c1s14n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 45
/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/Python/
3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93: UserWarning:
Prefork server exiting upon apparent death of parent
warn("%s exiting upon apparent death of %s" % (who, partner))
File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
File "$X/pyfr/scripts/main.py", line 109, in main
File "$X/pyfr/scripts/main.py", line 248, in process_restart
File "$X/pyfr/scripts/main.py", line 225, in _process_common
File "$X/pyfr/integrators/base.py", line 197, in run
File "$X/pyfr/integrators/std/controllers.py", line 72, in advance_to
File "$X/pyfr/integrators/std/steppers.py", line 201, in step
File "$X/pyfr/solvers/navstokes/system.py", line 78, in rhs
File "$X/pyfr/backends/base/backend.py", line 163, in runall
File "$X/pyfr/backends/cuda/types.py", line 133, in runall
return self
File "$X/pyfr/backends/cuda/types.py", line 105, in _wait
"of the metaclasses of all its bases")
pycuda._driver.RuntimeError: cuStreamSynchronize failed: uncorrectable ECC error encountered
Rank 30 [Fri Aug 26 20:05:09 2016] [c7-0c0s15n0] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30
/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/lib/python3.5/site-packages/pytools/prefork.py:93:
UserWarning: Prefork server exiting upon apparent death of parent
warn("%s exiting upon apparent death of %s" % (who, partner))
Rank 123 [Fri Aug 26 21:30:06 2016] [c5-1c2s4n3] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 123
Traceback (most recent call last):
File "/apps/common/UES/sandbox/jgp/ebforpyfr/easybuild/software/
Python/3.5.2-CrayGNU-2016.03/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.4.0', 'console_scripts', 'pyfr')()
File "$X/pyfr/scripts/main.py", line 109, in main
File "$X/pyfr/scripts/main.py", line 248, in process_restart
File "$X/pyfr/scripts/main.py", line 225, in _process_common
File "$X/pyfr/integrators/base.py", line 197, in run
File "$X/pyfr/integrators/std/controllers.py", line 75, in advance_to
File "$X/pyfr/integrators/std/controllers.py", line 36, in _accept_ste p
File "$X/pyfr/util.py", line 48, in __call__
File "$X/pyfr/util.py", line 48, in <genexpr>
File "$X/pyfr/plugins/nancheck.py", line 21, in __call__
RuntimeError: NaNs detected at t = 224.14999999998912
@pmessmer is there a way to increase the debug level regarding the cuStreamSynchronize errors ?
Reply from Peter:
Not aware of increasing verbosity of the stream sync errors but given the variety of failures I would expect these just to be the effect, not the cause. Can you try to run with setting the env variable:
CUDA_LAUNCH_BLOCKING=1
So we get at least rid of some asynchronicity.
Also, any chance to run under cudamcheck, maybe even with the racecheck tool ?
CUDA_LAUNCH_BLOCKING=yes
+ PMI_MMAP_SYNC_WAIT_TIME=4800 + 200cn:
CUDA_LAUNCH_BLOCKING=no
+ PMI_MMAP_SYNC_WAIT_TIME=4800 + 200cn:
CUDA_LAUNCH_BLOCKING=no
+ PMI_MMAP_SYNC_WAIT_TIME=4800 + 200cn:
CUDA_LAUNCH_BLOCKING=no
+ PMI_MMAP_SYNC_WAIT_TIME=4800 + 200cn:
I think we use 2 streams, and i doubt if we can reduce it more. They both represent two lines of operations which can be run async.
see above.
=> fixed with PMI_MMAP_SYNC_WAIT_TIME=4800
The most critical scaling issue, however, was that the very largest of jobs would frequently fail to run. When jobs requesting a large quantity of tasks (e.g., > 60,000 tasks), the job step would exit with a PMI2 failure to initialize message. In the end, we found that this was because of a difference in behavior between srun and aprun. By default, aprun copies executables prior to executing them, whereas srun does not. For most small to medium jobs, the srun behavior is probably fine, if not, “better”. However, running a large quantity of ranks directly from the parallel filesystem (lustre, DVS, whatever) would fail because the filesystem could not deliver the executable at that level of parallelism within the default ALPS timeout of 60s. The workaround is to set PMI_MMAP_SYNC_WAIT_TIME=300 in the application environment, which will increase the timeout to 300s instead of 60s. However, the solution was a feature that SchedMD implemented in later verions of 15.08 which merged the functionality of srun and sbcast (srun –bcast) to automatically copy the executable prior to execution. In 16.05 a further improvement of this to enable compression is coming. That is expected to put srun performance on the same level as aprun job startup.
Traceback (most recent call last):
File "/tmp/iyerarv/env/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.3.0', 'console_scripts', 'pyfr')()
File "$X/scripts/main.py", line 110, in main
File "$X/scripts/main.py", line 253, in process_restart
File "$X/scripts/main.py", line 210, in _process_common
File "$X/backends/__init__.py", line 11, in get_backend
File "$X/backends/cuda/base.py", line 33, in __init__
File "$Y/autoinit.py", line 9, in <module>
context = make_default_context()
File "/tmp/iyerarv/env/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.3.0', 'console_scripts', 'pyfr')()
File "$X/scripts/main.py", line 110, in main
File "$X/scripts/main.py", line 253, in process_restart
File "$X/scripts/main.py", line 210, in _process_common
File "$X/backends/__init__.py", line 11, in get_backend
File "$X/backends/cuda/base.py", line 33, in __init__
File "$Y/autoinit.py", line 9, in <module>
context = make_default_context()
File "$Y/tools.py", line 204, in make_default_context
"on any of the %d detected devices" % ndevices)
RuntimeError: make_default_context() wasn't able to create a context
on any of the 1 detected devices
x86_64.egg/pycuda/tools.py", line 204, in make_default_context
"on any of the %d detected devices" % ndevices)
RuntimeError: make_default_context() wasn't able to create a context
on any of the 1 detected devices
CUDA_LAUNCH_BLOCKING=1
PMI_MMAP_SYNC_WAIT_TIME=4800
Thu Sep 1 21:08:34 2016:
[PE_1341]:inet_connect:inet_connect: connect failed after 301 attempts
Thu Sep 1 21:08:34 2016:
[PE_1341]:_pmi_inet_setup:inet_connect failed
Thu Sep 1 21:08:34 2016:
[PE_1341]:_pmi_init:_pmi_inet_setup (full) returned -1
The following nodes are down: 02856, 03604, 00498, 01456, 04170
daint: pmi/5.0.10-1.0000.11050.0.0.ari
brisi: pmi/5.0.10-1.0000.11050.0.0.ari
PMI_MMAP_SYNC_WAIT_TIME=4800
[PE_63]:inet_connect:inet_connect: connect failed after 301 attempts
[PE_63]:_pmi_inet_setup:inet_connect failed
[PE_63]:_pmi_init:_pmi_inet_setup (full) returned -1
nid02219, nid02284, nid03220, nid03331, nid03775, nid04035, nid05022
Those nodes had network issues, [but] at a different time than what your job failure suggests:
2016-09-04T05:58:43.419474+02:00 c1-1c2s11n0 LNet: Quiesce start: hardware quiesce
2016-09-04T05:59:11.432366+02:00 c1-1c2s11n0 LNet: Quiesce complete: hardware quiesce
2016-09-04T15:12:33.482771+02:00 c1-1c2s11n0 WARNING: mem_cgroup_force_empty:
memory usage of 126976 bytes or ret 0 in cgroup: /slurm/uid_23556/job_493432/step_0
crayadm@daintsmw:/var/opt/cray/log/p0-current> grep c1-1c2s11n0 console-20160905
2016-09-05T01:51:09.182267+02:00 c1-1c2s11n0 Lustre: 24521:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply:
[sent 1473032974/real 1473032974] req@ffff8807244e8800 x1544286731626040/t0(0)
o400->snx11026-OST0075-osc-ffff88083cb62800@148.187.5.36@o2ib1013:28/4 lens 224/224 e 0 to 1 dl 1473033069 ref 1 fl Rpc:XNU/0/ffffffff rc 0/-1
2016-09-05T01:51:09.257834+02:00 c1-1c2s11n0 Lustre: 24514:0:(client.c:1910:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply:
[sent 1473032974/real 1473032974] req@ffff8807bfede400 x1544286731626012/t0(0)
o400->snx11026-OST006e-osc-ffff88083cb62800@148.187.5.34@o2ib1013:28/4 lens 224/224 e 0 to 1 dl 1473033069 ref 1 fl Rpc:XNU/0/ffffffff rc 0/-1
[...]
2016-09-05T02:01:13.960753+02:00 c1-1c2s11n0 Lustre: Skipped 1 previous similar message
2016-09-05T10:02:15.112972+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: The following tests from the 'reservation' set have failed in normal mode:
2016-09-05T10:02:15.138167+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (Reservation_Test) WARNING: Directory /proc/reservations/489959 still exists
2016-09-05T10:02:15.163368+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (Reservation_Test) WARNING: Reservation 489959 has status: rid 489959 flags ENDED jobs 9809705304225
2016-09-05T10:02:15.163411+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: (Admindown) Reservation_Test
2016-09-05T10:02:15.188592+02:00 c1-1c2s11n0 <node_health:5.1> RESID:489959 (xtnhc) FAILURES: End of list of 1 failed test(s)
2016-09-05T10:02:17.774901+02:00 c1-0c0s0n1 <node_health:5.1> RESID:489959 (xtcheckhealth) WARNING: Set node 2284 (c1-1c2s11n0) to suspect because the node failed a health test.
[PE_109]:inet_connect:inet_connect: connect failed after 301 attempts
[PE_109]:_pmi_inet_setup:inet_connect failed
[PE_109]:_pmi_init:_pmi_inet_setup (full) returned -1
nid02015, nid02023, nid02219, nid02284, nid03220, nid03331, nid03775, nid04035, nid05022
PMI 2.1.4: 775317, 777046 inet_connect: connect failed after 31 attempts"
/project/csstaff/inputs/pyfr/logs/0200cn/
export PMI_MMAP_SYNC_WAIT_TIME=4800
job01: export CUDA_LAUNCH_BLOCKING=1
job02: unset CUDA_LAUNCH_BLOCKING
job03: export CUDA_LAUNCH_BLOCKING=1
job04: unset CUDA_LAUNCH_BLOCKING
Tue Sep 6 01:55:21 2016: [PE_98]:_pmi_inet_setup:inet_connect failed
Tue Sep 6 01:55:21 2016: [PE_98]:_pmi_init:_pmi_inet_setup (full) returned -1
[Tue Sep 6 01:55:24 2016] [c9-0c0s3n2]
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(547):
MPID_Init(203).......: channel initialization failed
MPID_Init(584).......: PMI2 init failed: 1
slurmctld 4199 p0-20160901t141153 - error: Job 497289 has zero end_time https://bugs.schedmd.com/show_bug.cgi?id=3053
piccinal@daint01:/scratch/daint/piccinal/24315/TR2/0200cn/06 $
cat o_497285-loop01
99.9% [+> ] 224.00/224.33 ela: 00:01:51 rem: 11:18:49
slurmstepd: Munge decode failed: Expired credential
slurmstepd: Verifying authentication credential: Expired credential
piccinal@daint01:/scratch/daint/piccinal/24315/TR2/0200cn/07 $
cat o_496033-loop01
99.9% [+> ] 224.01/224.33 ela: 00:23:31 rem: 09:05:13
slurmstepd: Munge decode failed: Expired credential
slurmstepd: Verifying authentication credential: Expired credential
Some nodes seem to have problems reaching back the sdb and Slurm. It looks like some /var entries are stale. There is not much more we can get from NHC.
/scratch/daint/piccinal/24315/TR2/0200cn/10/o_497282-loop*
/scratch/daint/piccinal/24315/TR2/0200cn/09/o_497283-loop*
/scratch/daint/piccinal/24315/TR2/0200cn/08/o_497284-loop*
/scratch/daint/piccinal/24315/TR2/0200cn/05/o_497286-loop*
/scratch/daint/piccinal/24315/TR2/0200cn/04/o_497287-loop*
/scratch/daint/piccinal/24315/TR2/0200cn/03/o_497288-loop*
/scratch/daint/piccinal/24315/TR2/0200cn/01/o_497290-loop*
o_497282-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:55:47 rem: 00:00:00
o_497282-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:26 rem: 00:00:00
o_497282-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
o_497282-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00
o_497283-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:55:53 rem: 00:00:00
o_497283-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
o_497283-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:23 rem: 00:00:00
o_497283-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
o_497284-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:54:00 rem: 00:00:00
o_497284-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00
o_497284-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:16 rem: 00:00:00
o_497284-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00
o_497286-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:54:09 rem: 00:00:00
o_497286-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:17 rem: 00:00:00
o_497286-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:19 rem: 00:00:00
o_497286-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:19 rem: 00:00:00
o_497287-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:34 rem: 00:00:00
o_497287-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:31 rem: 00:00:00
o_497287-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:24 rem: 00:00:00
o_497287-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
o_497288-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:42 rem: 00:00:00
o_497288-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:22 rem: 00:00:00
o_497288-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:24 rem: 00:00:00
o_497288-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:19 rem: 00:00:00
o_497290-loop01 == 100.0% [+>] 224.33/224.33 ela: 00:58:21 rem: 00:00:00
o_497290-loop02 == 100.0% [+>] 224.33/224.33 ela: 00:50:18 rem: 00:00:00
o_497290-loop03 == 100.0% [+>] 224.33/224.33 ela: 00:52:20 rem: 00:00:00
o_497290-loop04 == 100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
Were the failing jobs the ones with CUDA_LAUNCH_BLOCKING set or unset?
And so there was no more errors of the form: cudaMemcpyHostToDevice Async error :: ret code: an illegal memory access was encountered
@pmessmer
Were the failing jobs the ones with CUDA_LAUNCH_BLOCKING set or unset?
_pmi_inet_setup: job497289/o_497289-loop0[1-4]
: this failure happens with $CUDA_LAUNCH_BLOCKING=1 (job01 and job03) and without $CUDA_LAUNCH_BLOCKING=1 (job02 and job04).Expired credential
: job01 ($CUDA_LAUNCH_BLOCKING=1) hangs but this looks more like a slurm issue.an illegal memory access was encountered
- We did not get the error probably because i was lucky enough to have good gpus. It's not deterministic.
So the pmi_inet_setup problem is more an MPI issue than a GPU related one. Do we know what this code does before MPI_Init()?
But it seems like even without CUDA_LAUNCH_BLOCKING we are no longer getting the cudaMemcpyAsync problem. Maybe that one was only the manifestation of a different issue?
$CUDA_LAUNCH_BLOCKING=1
, half jobs without cuStreamSynchronize failed: unknown error
cuInit failed
_pmi_inet_setup:inet_connect failed
/project/csstaff/inputs/pyfr/logs/0200cn/
export PMI_MMAP_SYNC_WAIT_TIME=4800
job01: export CUDA_LAUNCH_BLOCKING=1
job02: unset CUDA_LAUNCH_BLOCKING
job03: export CUDA_LAUNCH_BLOCKING=1
job04: unset CUDA_LAUNCH_BLOCKING
99.9% [++++++++++++++++++++++++++> ] 224.02/224.33 ela: 00:11:14 rem: 03:22:05
Traceback (most recent call last):
File " pyfr-1.4.0-py3.5.egg/pyfr/backends/cuda/types.py", line 105, in
_wait "of the metaclasses of all its bases")
pycuda._driver.Error: cuStreamSynchronize failed: unknown error
Rank 160 [Tue Sep 6 17:32:09 2016] [c9-1c1s0n2]
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 160
cuInit failed: no CUDA-capable device is detected
NodeName=nid03714
State=IDLE+DRAIN => not usable by user jobs
Reason=admindown by NHC [root@Ystday 17:52]
2016-09-06T18:07:28.263027+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:07:28.288164+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:07:58.311791+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GPU_TEST) WARNING: ACC: nvidia_test.c:199 ERROR -
+CUDA_ERROR_NO_DEVICE
2016-09-06T18:07:58.343542+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) Reset output: Unable to determine the PCI bus id for the
+target device: GPU is lost
2016-09-06T18:07:58.343560+02:00 c9-1c1s0n2 Error executing real-nvidia-smi -r -i 0
2016-09-06T18:07:58.343576+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:07:58.406152+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:08:24.764769+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: The following tests from the 'application' set
+have failed in suspect mode:
2016-09-06T18:08:24.790020+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GPU_TEST) WARNING: ACC: nvidia_test.c:199 ERROR -
+CUDA_ERROR_NO_DEVICE
2016-09-06T18:08:24.815219+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) Reset output: Unable to determine the PCI bus id for the
+target device: GPU is lost
2016-09-06T18:08:24.815247+02:00 c9-1c1s0n2 Error executing real-nvidia-smi -r -i 0
2016-09-06T18:08:24.840448+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (GAT) GPU warm reset failed after accelerator test failure
2016-09-06T18:08:24.877830+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (Plugin) WARNING: Process (/opt/cray/nodehealth/default/bin/gat.sh
+-m 10% -r) returned with exit code 1
2016-09-06T18:08:24.877860+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: (Admindown) Plugin
+/opt/cray/nodehealth/default/bin/gat.sh -m 10% -r
2016-09-06T18:08:24.877873+02:00 c9-1c1s0n2 <node_health:5.1> APID:50000498868 (xtnhc) FAILURES: End of list of 1 failed test(s)
2016-09-06T18:08:24.934588+02:00 c1-0c0s0n1 <node_health:5.1> APID:50000498868 (xtcheckhealth) WARNING: Could not set node 3714 (c9-1c1s0n2) to
+admindown because its state is admindown.
inet_connect:inet_connect: connect failed after 301 attempts
_pmi_inet_setup:inet_connect failed
_pmi_init:_pmi_inet_setup (full) returned -1
JobID Start End Elapsed JobName
------------ ------------------- ------------------- ---------- ----------
498870.7 2016/09/06-21:02:04 2016/09/06-21:15:04 00:13:00 python
Tue Sep 6 21:13:49 2016:
[PE_80]:inet_connect:inet_connect: connect failed after 301 attempts
498870.8 2016/09/06-21:15:05 2016/09/06-21:27:17 00:12:12 python
Tue Sep 6 21:26:01 2016:
[PE_80]:inet_connect:inet_connect: connect failed after 301 attempts
498870.9 2016/09/06-21:27:19 2016/09/06-21:39:27 00:12:08 python
Tue Sep 6 21:38:10 2016:
[PE_81]:inet_connect:inet_connect: connect failed after 301 attempts
498870.10 2016/09/06-21:39:29 2016/09/06-21:51:47 00:12:18 python
Tue Sep 6 21:50:32 2016:
[PE_80]:inet_connect:inet_connect: connect failed after 301 attempts
/scratch/daint/piccinal/24315/TR2/0200cn/01/o_498865-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/02/o_498866-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/03/o_498867-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/05/o_498869-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/07/o_498871-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/08/o_498872-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/09/o_498873-loop0*
/scratch/daint/piccinal/24315/TR2/0200cn/10/o_498874-loop0*
100.0% [+>] 224.33/224.33 ela: 00:57:24 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 01:02:27 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:23 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 01:01:00 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:16 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:24 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 01:00:04 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:23 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:55:16 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:54:05 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:20 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:22 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:54:23 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:25 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:21 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:54:11 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:21 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:52:30 rem: 00:00:00
100.0% [+>] 224.33/224.33 ela: 00:50:28 rem: 00:00:00
100.0% [++++++++++++++++++++++++++> ] 224.10/224.10 ela: 02:50:43 rem: 00:00:00
200 nodes job
Environment
Inputs
Run
TODO: 4000 nodes job