eth-cscs / pyfr

pyfr@cscs (https://github.com/vincentlab/PyFR)
0 stars 0 forks source link

pycuda._driver.Error: cuStreamSynchronize failed #1

Open jgphpc opened 8 years ago

jgphpc commented 8 years ago

2 types of error (see below): cuStreamSynchronize failed: uncorrectable ECC error encountered and cuStreamSynchronize failed: unknown error (usually leads to "GPU fallen of the bus" error ).

job357235

  pycuda._driver.Error: cuStreamSynchronize failed: unknown error
  DBG: nid01361 [Mon Aug  8 21:31:06 2016]: building axnpby
  Rank 1900 [Mon Aug  8 21:33:04 2016] [c2-1c2s3n2] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 1900
  DBG: nid01361 [Mon Aug  8 21:31:06 2016]: built axnpby
  /tmp/iyerarv/env/lib/python3.4/site-packages/pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93: 
UserWarning: Prefork server exiting upon apparent death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

  Traceback (most recent call last):
    File "/tmp/iyerarv/env/bin/pyfr", line 9, in <module>
      load_entry_point('pyfr==1.3.0', 'console_scripts', 'pyfr')()

File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/scripts/main.py", line 110, in main

    File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/scripts/main.py", line 253, in process_restart

    File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/scripts/main.py", line 230, in _process_common

    File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/integrators/base.py", line 122, in run

    File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/integrators/controllers.py", line 125, in advance_to

    File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/integrators/steppers.py", line 227, in step

    File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/solvers/navstokes/system.py", line 68, in rhs

File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/backends/base/backend.py", line 163, in runall

    File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/backends/cuda/types.py", line 133, in runall      

    File "/tmp/iyerarv/env/lib/python3.4/site-packages/pyfr-1.3.0-py3.4.egg/
pyfr/backends/cuda/types.py", line 105, in _wait

DBG: nid01361 [Mon Aug  8 21:31:05 2016]: Start time: 1470684665.8907437
DBG: nid00271 [Mon Aug  8 21:29:51 2016]: built mpiconu

job341733

  pycuda._driver.Error: cuStreamSynchronize failed: unknown error
  Rank 1108 [Wed Aug  3 20:42:49 2016] [c2-1c2s3n2] 
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1108
  /tmp/iyerarv/env/lib/python3.4/site-packages/
pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93: UserWarning: 
Prefork server exiting upon apparent   death of parent
    warn("%s exiting upon apparent death of %s" % (who, partner))

job324125

      ## L1469087 pycuda._driver.RuntimeError: cuStreamSynchronize failed:
 uncorrectable ECC error encountered
      ## 1468999:  99.9% [++++++++++++++++++++++++++> ]
224.03/224.33 ela: 00:08:25 rem: 01:28:04 nfevals: 960 wclocktime: 505.286358
      # pycuda._driver.RuntimeError: cuStreamSynchronize failed: 
uncorrectable ECC error encountered                                              
      # Rank 1456 [Fri Jul 29 22:35:48 2016] [c0-2c1s8n3] 
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1456
  scontrol show node nid03939 |grep State 
# => State=MAINT*
  xtprocadmin |grep c0-2c1s8n3            
# => 3939    0xf63  c0-2c1s8n3  compute      down       batch

job341733.4

pycuda._driver.Error: cuStreamSynchronize failed: unknown error
      # Rank 1108 [Wed Aug  3 20:42:49 2016] [c2-1c2s3n2] 
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1108
      # /tmp/iyerarv/env/lib/python3.4/site-packages/pytools-2016.2.1-py3.4.egg/
pytools/prefork.py:93: 
      # UserWarning: Prefork server exiting upon apparent death of parent

job343408.4

      # pycuda._driver.Error: cuStreamSynchronize failed: unknown error
      # Rank 1455 [Wed Aug  3 22:39:12 2016] [c8-1c1s2n1] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 1455
      # /tmp/iyerarv/env/lib/python3.4/site-packages/
pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93: 
      # UserWarning: Prefork server exiting upon apparent death of parent

job?

      # pycuda._driver.Error: cuStreamSynchronize failed: unknown error
      # Rank 2915 [Thu Aug  4 16:54:20 2016] [c7-2c0s14n2] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 2915
      # /tmp/iyerarv/env/lib/python3.4/site-packages/
pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93: 
      # UserWarning: Prefork server exiting upon apparent death of parent
      # srun: error: nid05242: task 2915: Aborted

job?

      # pycuda._driver.RuntimeError: cuStreamSynchronize failed: 
uncorrectable ECC error encountered
      # Rank 1568 [Fri Aug  5 14:06:00 2016] [c1-1c0s8n2] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 1568
      # /tmp/iyerarv/env/lib/python3.4/site-packages/
pytools-2016.2.1-py3.4.egg/pytools/prefork.py:93: 
      # UserWarning: Prefork server exiting upon apparent death of parent
      #  warn("%s exiting upon apparent death of %s" % (who, partner))
      # srun: error: Node failure on nid02146