2 types of error (see below):
cuStreamSynchronize failed: uncorrectable ECC error encountered
and
cuStreamSynchronize failed: unknown error (usually leads to "GPU fallen of the bus" error ).
pycuda._driver.Error: cuStreamSynchronize failed: unknown error
DBG: nid01361 [Mon Aug 8 21:31:06 2016]: building axnpby
Rank 1900 [Mon Aug 8 21:33:04 2016] [c2-1c2s3n2] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 1900
DBG: nid01361 [Mon Aug 8 21:31:06 2016]: built axnpby
/tmp/iyerarv/env/lib/python3.4/site-packages/pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93:
UserWarning: Prefork server exiting upon apparent death of parent
warn("%s exiting upon apparent death of %s" % (who, partner))
Traceback (most recent call last):
File "/tmp/iyerarv/env/bin/pyfr", line 9, in <module>
load_entry_point('pyfr==1.3.0', 'console_scripts', 'pyfr')()
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/scripts/main.py", line 110, in main
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/scripts/main.py", line 253, in process_restart
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/scripts/main.py", line 230, in _process_common
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/integrators/base.py", line 122, in run
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/integrators/controllers.py", line 125, in advance_to
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/integrators/steppers.py", line 227, in step
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/solvers/navstokes/system.py", line 68, in rhs
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/backends/base/backend.py", line 163, in runall
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/backends/cuda/types.py", line 133, in runall
File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/backends/cuda/types.py", line 105, in _wait
DBG: nid01361 [Mon Aug 8 21:31:05 2016]: Start time: 1470684665.8907437
DBG: nid00271 [Mon Aug 8 21:29:51 2016]: built mpiconu
vim T106D_cascade_3d-2-176.000PCC-001RCPLDG-TR1.0004.error
pycuda._driver.Error: cuStreamSynchronize failed: unknown error
Rank 1108 [Wed Aug 3 20:42:49 2016] [c2-1c2s3n2]
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1108
/tmp/iyerarv/env/lib/python3.4/site-packages/
pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93: UserWarning:
Prefork server exiting upon apparent death of parent
warn("%s exiting upon apparent death of %s" % (who, partner))
job324125
2000 cn (<1h)
cd /scratch/daint/iyerarv/T106D-3D-Scaling-Daint/
cd T106D_cascade_3d-1-035.200PCC-010RCPLDG/TR1/jobs/324125/
scontrol show node nid03939 |grep State
# => State=MAINT*
xtprocadmin |grep c0-2c1s8n3
# => 3939 0xf63 c0-2c1s8n3 compute down batch
job341733.4
2000 nodes
pycuda._driver.Error: cuStreamSynchronize failed: unknown error
# Rank 1108 [Wed Aug 3 20:42:49 2016] [c2-1c2s3n2]
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1108
# /tmp/iyerarv/env/lib/python3.4/site-packages/pytools-2016.2.1-py3.4.egg/
pytools/prefork.py:93:
# UserWarning: Prefork server exiting upon apparent death of parent
job343408.4
2000 nodes
# pycuda._driver.Error: cuStreamSynchronize failed: unknown error
# Rank 1455 [Wed Aug 3 22:39:12 2016] [c8-1c1s2n1] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 1455
# /tmp/iyerarv/env/lib/python3.4/site-packages/
pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93:
# UserWarning: Prefork server exiting upon apparent death of parent
job?
3000 nodes
# pycuda._driver.Error: cuStreamSynchronize failed: unknown error
# Rank 2915 [Thu Aug 4 16:54:20 2016] [c7-2c0s14n2] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 2915
# /tmp/iyerarv/env/lib/python3.4/site-packages/
pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93:
# UserWarning: Prefork server exiting upon apparent death of parent
# srun: error: nid05242: task 2915: Aborted
job?
# pycuda._driver.RuntimeError: cuStreamSynchronize failed:
uncorrectable ECC error encountered
# Rank 1568 [Fri Aug 5 14:06:00 2016] [c1-1c0s8n2] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 1568
# /tmp/iyerarv/env/lib/python3.4/site-packages/
pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93:
# UserWarning: Prefork server exiting upon apparent death of parent
# warn("%s exiting upon apparent death of %s" % (who, partner))
# srun: error: Node failure on nid02146
2 types of error (see below):
cuStreamSynchronize failed: uncorrectable ECC error encountered
andcuStreamSynchronize failed: unknown error
(usually leads to "GPU fallen of the bus" error ).job357235
job341733
job324125
job341733.4
job343408.4
job?
job?