IDSIA / brainstorm

Fast, flexible and fun neural networks.
Other
1.3k stars 154 forks source link

PyCudaHandler: Illegal memory access (cuMemcpyDtoD failed) #84

Open Qwlouse opened 8 years ago

Qwlouse commented 8 years ago

I encountered a weird crash. Not really sure what happened, because it ran fine for 29 epochs before that. I can't reproduce it, but I wanted to keep it here for future reference.

 - - - - - - - - - - - -  Epoch 30  - - - - - - - - - - - -
[====1====2====3=validation
  total_loss                            : 122.73372146606445
  Accuracy                              : 0.6446704
===4===PyCUDA WARNING: a clean-up operation failed (dead context maybe?)

cuMemFree failed: an illegal memory access was encountered
ERROR - diag_hutter - Failed after 2 days, 7:11:11!
/home/greff/Programming/sacred/sacred/observers/mongo.py:110: DeprecationWarning: save is deprecated. Use insert_one or replace_one instead
  self.runs.save(self.run_entry)
Traceback (most recent calls WITHOUT Sacred internals):
  File "diag_hutter.py", line 87, in run
    trainer.train(network, getter_tr, valid_getter=getter_va)
  File "/home/greff/Programming/brainstorm/brainstorm/training/trainer.py", line 99, in train
    self.stepper.run()
  File "/home/greff/Programming/brainstorm/brainstorm/training/steppers.py", line 103, in run
    self.net.forward_pass(training_pass=True)
  File "/home/greff/Programming/brainstorm/brainstorm/structure/network.py", line 431, in forward_pass
    layer.forward_pass(self.buffer[layer_name], training_pass)
  File "/home/greff/Programming/brainstorm/brainstorm/layers/loss_layer.py", line 46, in forward_pass
    buffers.outputs.loss.reshape(tuple()))
  File "/home/greff/Programming/brainstorm/brainstorm/handlers/pycuda_handler.py", line 378, in sum_t
    self.copy_to(cumisc.sum(a), out)
  File "/home/greff/Programming/brainstorm/brainstorm/handlers/pycuda_handler.py", line 84, in copy_to
    pycuda.driver.memcpy_dtod(dest.gpudata, src.gpudata, dest.nbytes)
Traceback (most recent call last):
  File "/home/greff/Programming/sacred/sacred/experiment.py", line 165, in run_commandline
    args)
  File "/home/greff/Programming/sacred/sacred/experiment.py", line 138, in run_command
    run()
  File "/home/greff/Programming/sacred/sacred/run.py", line 144, in __call__
    self.result = self.main_function(*args)
  File "/home/greff/Programming/sacred/sacred/config/captured_function.py", line 47, in captured_function
    result = wrapped(*args, **kwargs)
  File "diag_hutter.py", line 87, in run
    trainer.train(network, getter_tr, valid_getter=getter_va)
  File "/home/greff/Programming/brainstorm/brainstorm/training/trainer.py", line 99, in train
    self.stepper.run()
  File "/home/greff/Programming/brainstorm/brainstorm/training/steppers.py", line 103, in run
    self.net.forward_pass(training_pass=True)
  File "/home/greff/Programming/brainstorm/brainstorm/structure/network.py", line 431, in forward_pass
    layer.forward_pass(self.buffer[layer_name], training_pass)
  File "/home/greff/Programming/brainstorm/brainstorm/layers/loss_layer.py", line 46, in forward_pass
    buffers.outputs.loss.reshape(tuple()))
  File "/home/greff/Programming/brainstorm/brainstorm/handlers/pycuda_handler.py", line 378, in sum_t
    self.copy_to(cumisc.sum(a), out)
  File "/home/greff/Programming/brainstorm/brainstorm/handlers/pycuda_handler.py", line 84, in copy_to
    pycuda.driver.memcpy_dtod(dest.gpudata, src.gpudata, dest.nbytes)
pycuda._driver.LogicError: cuMemcpyDtoD failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: an illegal memory access was encountered
sys:1: ResourceWarning: unclosed <socket.socket fd=23, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 35871), raddr=('127.0.0.1', 5006)>
/home/greff/venv/py3/lib/python3.4/importlib/_bootstrap.py:2150: ImportWarning: sys.meta_path is empty
flukeskywalker commented 8 years ago

Our usage of skcuda.sum() was rather weird, due to a bug in scikit-cuda. The fix above will hopefully avoid such crashes. Let's keep this open for a while to see if the bug re-appears.