Single GPU training script error

emushtaq commented 6 years ago

Hello

I am trying to run the training script on the SVHN dataset with the following command: python chainer/train_svhn.py curriculum.json /logs --char-map datasets/svhn/svhn_char_map.json --blank-label 0 -b 10 -g 0

Running it on a single GPU. I followed the steps to run on a single GPU like it is mentioned in https://github.com/Bartzi/see/issues/6.
Using Cuda9.0 and equivalent cupy-cuda90 library. Chainer shows Truefor chainer.cuda.available and chainer.cuda.cudnn_enabled

I get the following error

/usr/local/lib/python3.5/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "chainer/train_svhn.py", line 147, in updater = StandardUpdater(iterator=train_iterators, optimizer=optimizer, device=args.gpus) File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 144, in init if device is not None and device >= 0: TypeError: unorderable types: list() >= int() Exception ignored in: <bound method MultiprocessIterator.del of <chainer.iterators.multiprocess_iterator.MultiprocessIterator object at 0x7fbddd666c50>> Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/chainer/iterators/multiprocess_iterator.py", line 117, in del File "/usr/local/lib/python3.5/dist-packages/chainer/iterators/multiprocess_iterator.py", line 242, in terminate AttributeError: 'NoneType' object has no attribute 'STATUS_TERMINATE' Exception ignored in: <bound method MultiprocessIterator.del of <chainer.iterators.multiprocess_iterator.MultiprocessIterator object at 0x7fbddd666d68>> Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/chainer/iterators/multiprocess_iterator.py", line 117, in del File "/usr/local/lib/python3.5/dist-packages/chainer/iterators/multiprocess_iterator.py", line 242, in terminate AttributeError: 'NoneType' object has no attribute 'STATUS_TERMINATE'

Please help resolve. Thanks!

Bartzi commented 6 years ago

This is your problem: TypeError: unorderable types: list() >= int() It means that you gave a list with one element to the StandardUpdater. You have to provide only a single integer, by i.e. writing StandardUpdater(iterator=train_iterators, optimizer=optimizer, device=args.gpus[0])

emushtaq commented 6 years ago

Thanks!

I made a few more changes to get it working on one GPU. Line https://github.com/Bartzi/see/blob/edcde78993dfde0f79d120252b7edfd440944a9b/chainer/train_svhn.py#L193 and https://github.com/Bartzi/see/blob/edcde78993dfde0f79d120252b7edfd440944a9b/chainer/train_svhn.py#L206 , changing them to updater.device.

I now have a Cupy NVRTC error.

Exception in main training loop: nvrtc: error: failed to load builtins Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 299, in run update() File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 223, in update self.update_core() File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 234, in update_core optimizer.update(loss_func, in_arrays) File "/usr/local/lib/python3.5/dist-packages/chainer/optimizer.py", line 534, in update loss = lossfun(args, kwds) File "/workdir/workspace/see/chainer/utils/multi_accuracy_classifier.py", line 44, in call self.y = self.predictor(*x) File "/workdir/workspace/see/chainer/models/svhn.py", line 209, in call h = self.localization_net(images) File "/workdir/workspace/see/chainer/models/svhn.py", line 41, in call h = self.bn0(self.conv0(images)) File "/usr/local/lib/python3.5/dist-packages/chainer/links/connection/convolution_2d.py", line 154, in call self._initialize_params(x.shape[1]) File "/usr/local/lib/python3.5/dist-packages/chainer/links/connection/convolution_2d.py", line 141, in _initialize_params self.W.initialize(W_shape) File "/usr/local/lib/python3.5/dist-packages/chainer/variable.py", line 1250, in initialize data = initializers.generate_array(self.initializer, shape, xp) File "/usr/local/lib/python3.5/dist-packages/chainer/initializers/init.py", line 46, in generate_array initializer(array) File "/usr/local/lib/python3.5/dist-packages/chainer/initializers/normal.py", line 68, in call Normal(s)(array) File "/usr/local/lib/python3.5/dist-packages/chainer/initializers/normal.py", line 36, in call array[...] = xp.random.normal(*args) File "/usr/local/lib/python3.5/dist-packages/cupy/random/distributions.py", line 94, in normal cupy.multiply(x, scale, out=x) File "/usr/local/lib/python3.5/dist-packages/cupy/core/fusion.py", line 713, in call return self._cupy_op(args, kwargs) File "cupy/core/elementwise.pxi", line 826, in cupy.core.core.ufunc.call File "cupy/util.pyx", line 39, in cupy.util.memoize.decorator.ret File "cupy/core/elementwise.pxi", line 625, in cupy.core.core._get_ufunc_kernel File "cupy/core/elementwise.pxi", line 33, in cupy.core.core._get_simple_elementwise_kernel File "cupy/core/carray.pxi", line 146, in cupy.core.core.compile_with_cache File "/usr/local/lib/python3.5/dist-packages/cupy/cuda/compiler.py", line 135, in compile_with_cache base = _preprocess('', options, arch) File "/usr/local/lib/python3.5/dist-packages/cupy/cuda/compiler.py", line 98, in _preprocess result = prog.compile(options) File "/usr/local/lib/python3.5/dist-packages/cupy/cuda/compiler.py", line 245, in compile raise CompileException(log, self.src, self.name, options) Will finalize trainer extensions and updater before reraising the exception. Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/cupy/cuda/compiler.py", line 241, in compile nvrtc.compileProgram(self.ptr, options) File "cupy/cuda/nvrtc.pyx", line 98, in cupy.cuda.nvrtc.compileProgram File "cupy/cuda/nvrtc.pyx", line 108, in cupy.cuda.nvrtc.compileProgram File "cupy/cuda/nvrtc.pyx", line 53, in cupy.cuda.nvrtc.check_status cupy.cuda.nvrtc.NVRTCError: NVRTC_ERROR unknown (7)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "chainer/train_svhn.py", line 258, in trainer.run() File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 313, in run six.reraise(sys.exc_info()) File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 299, in run update() File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 223, in update self.update_core() File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 234, in update_core optimizer.update(loss_func, in_arrays) File "/usr/local/lib/python3.5/dist-packages/chainer/optimizer.py", line 534, in update loss = lossfun(*args, kwds) File "/workdir/workspace/see/chainer/utils/multi_accuracy_classifier.py", line 44, in call self.y = self.predictor(*x) File "/workdir/workspace/see/chainer/models/svhn.py", line 209, in call h = self.localization_net(images) File "/workdir/workspace/see/chainer/models/svhn.py", line 41, in call h = self.bn0(self.conv0(images)) File "/usr/local/lib/python3.5/dist-packages/chainer/links/connection/convolution_2d.py", line 154, in call self._initialize_params(x.shape[1]) File "/usr/local/lib/python3.5/dist-packages/chainer/links/connection/convolution_2d.py", line 141, in _initialize_params self.W.initialize(W_shape) File "/usr/local/lib/python3.5/dist-packages/chainer/variable.py", line 1250, in initialize data = initializers.generate_array(self.initializer, shape, xp) File "/usr/local/lib/python3.5/dist-packages/chainer/initializers/init.py", line 46, in generate_array initializer(array) File "/usr/local/lib/python3.5/dist-packages/chainer/initializers/normal.py", line 68, in call Normal(s)(array) File "/usr/local/lib/python3.5/dist-packages/chainer/initializers/normal.py", line 36, in call array[...] = xp.random.normal(*args) File "/usr/local/lib/python3.5/dist-packages/cupy/random/distributions.py", line 94, in normal cupy.multiply(x, scale, out=x) File "/usr/local/lib/python3.5/dist-packages/cupy/core/fusion.py", line 713, in call return self._cupy_op(args, kwargs) File "cupy/core/elementwise.pxi", line 826, in cupy.core.core.ufunc.call File "cupy/util.pyx", line 39, in cupy.util.memoize.decorator.ret File "cupy/core/elementwise.pxi", line 625, in cupy.core.core._get_ufunc_kernel File "cupy/core/elementwise.pxi", line 33, in cupy.core.core._get_simple_elementwise_kernel File "cupy/core/carray.pxi", line 146, in cupy.core.core.compile_with_cache File "/usr/local/lib/python3.5/dist-packages/cupy/cuda/compiler.py", line 135, in compile_with_cache base = _preprocess('', options, arch) File "/usr/local/lib/python3.5/dist-packages/cupy/cuda/compiler.py", line 98, in _preprocess result = prog.compile(options) File "/usr/local/lib/python3.5/dist-packages/cupy/cuda/compiler.py", line 245, in compile raise CompileException(log, self.src, self.name, options) cupy.cuda.compiler.CompileException: nvrtc: error: failed to load builtins

Bartzi commented 6 years ago

Hmm, seems like your CUDA environment is either not correctly installed, or your paths to the CUDA toolkit are not set correctly... but that is just a guess. Its definitely a problem with your development environment.

emushtaq commented 6 years ago

Ok. I’ll try setting it up again and giving it another go. Closing till then. Thanks for your help.

emushtaq commented 6 years ago

Tried a completely new setup.

I changed the following to make it run on a single GPU. updater = StandardUpdater(iterator=train_iterators, optimizer=optimizer, device=args.gpus[0]) It seems to be causing a segmentation fault. Any idea why this may be happening.

emushtaq commented 6 years ago

was running the training script with the flag -g 0 in the single GPU case. This seems to be the reason for the above error.

emushtaq commented 6 years ago

After resolving a few environment issues, stumbled into this error. Help appreciated.

CMD: python chainer/train_svhn.py curriculum.json /logs --char-map datasets/svhn/svhn_char_map.json --blank-label 0 -b 10

python chainer/train_svhn.py curriculum.json /logs --char-map datasets/svhn/svhn_char_map.json --blan

Exception in main training loop: list indices must be integers or slices, not str
Traceback (most recent call last):thon chainer/train_svhn.py curriculum.json /logs --char-map datasets/svhn/svhn_char_map.json --bla
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 296, in run
    while not stop_trigger(self):ython chainer/train_svhn.py curriculum.json /logs --char-map datasets/svhn/svhn_char_map.json --blan
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/triggers/interval_trigger.py", line 51, in __call__
    epoch_detail = updater.epoch_detail
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 159, in epoch_detail
    return self._iterators['main'].epoch_detail
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 313, in run
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 296, in run
    while not stop_trigger(self):
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/triggers/interval_trigger.py", line 51, in __call__
    epoch_detail = updater.epoch_detail
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 159, in epoch_detail
    return self._iterators['main'].epoch_detail
TypeError: list indices must be integers or slices, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "chainer/train_svhn.py", line 258, in <module>
    trainer.run()
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 319, in run
    self.updater.finalize()
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 177, in finalize
    for iterator in six.itervalues(self._iterators):
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 584, in itervalues
    return iter(d.values(**kw))
AttributeError: 'list' object has no attribute 'values'

Bartzi commented 6 years ago

Yeah I see the problem. You changed the Updater and I did not tell you that you'll also need to change this line to train_iterators = chainer.iterators.MultiprocessIterator(gpu_datasets[0], args.batch_size). The StandardUpdater can not handle a list of iterators, but needs just one.

emushtaq commented 6 years ago

ah. But this leads to a segmentation fault with the trainer.run() call. Not sure what's happening

Bartzi commented 6 years ago

interesting, maybe it works better with a Docker Container?

emushtaq commented 6 years ago

I'll try the docker container and get back.

emushtaq commented 6 years ago

Tried the docker file to start fresh, still having the same error. This is the gdb backtrace of the segmentation fault.

[New Thread 0x7fff015e5700 (LWP 1807)]
[New Thread 0x7fff00de4700 (LWP 1808)]
[New Thread 0x7fff005e3700 (LWP 1809)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff7de373c in elf_machine_rela (skip_ifunc=0, reloc_addr_arg=0x7fffd590ef40, version=0x48, sym=0x7fffd56034c0, reloc=0x7fffd5618640, map=0xeab900) at ../sysdeps/x86_64/dl-machine.h:301
301 ../sysdeps/x86_64/dl-machine.h: No such file or directory.

Bartzi commented 6 years ago

:man_shrugging: I don't know... have you tried googling the error?

emushtaq commented 6 years ago

Hmm. ok. That hadn't really proved fruitful. Just updating with more logs from the thread that had the segmentation fault, this time using fault handler


Current thread 0x00007f9d8c79e700 (most recent call first):
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraypad.py", line 142 in _append_const
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraypad.py", line 1371 in pad
  File "/usr/local/lib/python3.5/dist-packages/chainer/utils/conv.py", line 76 in im2col_cpu
  File "/usr/local/lib/python3.5/dist-packages/chainer/functions/pooling/max_pooling_2d.py", line 20 in forward_cpu
  File "/usr/local/lib/python3.5/dist-packages/chainer/function_node.py", line 338 in forward
  File "/usr/local/lib/python3.5/dist-packages/chainer/function_node.py", line 245 in apply
  File "/usr/local/lib/python3.5/dist-packages/chainer/functions/pooling/max_pooling_2d.py", line 303 in max_pooling_2d
  File "/workdir/see/chainer/models/svhn.py", line 45 in __call__
  File "/workdir/see/chainer/models/svhn.py", line 209 in __call__
  File "/workdir/see/chainer/utils/multi_accuracy_classifier.py", line 44 in __call__
  File "/usr/local/lib/python3.5/dist-packages/chainer/optimizer.py", line 534 in update
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 234 in update_core
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 223 in update
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 299 in run
  File "chainer/train_svhn.py", line 262 in <module>
Segmentation fault (core dumped)

Bartzi commented 6 years ago

there is something going wrong in numpy, maybe your installation is faulty? Something on your machine that is causing the troubles?

emushtaq commented 6 years ago

Yes, it's getting pretty hard to debug these environmental issues. Anyways, I tried to uninstall numpy and installed it again. This is the new error stack

Current thread 0x00007eff1e418700 (most recent call first):
  File "/usr/local/lib/python3.5/dist-packages/chainer/functions/normalization/batch_normalization.py", line 178 in forward
  File "/usr/local/lib/python3.5/dist-packages/chainer/function.py", line 135 in forward
  File "/usr/local/lib/python3.5/dist-packages/chainer/function_node.py", line 245 in apply
  File "/usr/local/lib/python3.5/dist-packages/chainer/function.py", line 235 in __call__
  File "/usr/local/lib/python3.5/dist-packages/chainer/functions/normalization/batch_normalization.py", line 128 in backward
  File "/usr/local/lib/python3.5/dist-packages/chainer/function_node.py", line 514 in backward_accumulate
  File "/usr/local/lib/python3.5/dist-packages/chainer/variable.py", line 981 in _backward_main
  File "/usr/local/lib/python3.5/dist-packages/chainer/variable.py", line 880 in backward
  File "/usr/local/lib/python3.5/dist-packages/chainer/optimizer.py", line 539 in update
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 234 in update_core
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/updater.py", line 223 in update
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 299 in run
  File "chainer/train_svhn.py", line 262 in <module>
Segmentation fault (core dumped)

Guessing that it might be chainer, I tried with the newer 4.0.0 version. Was not fruitful, same error.

emushtaq commented 6 years ago

After a lot of attempts with trying to work in a fresh environment (using the included dockerfile), I have made some progress, it is starting to train. But now, I am getting OOM exceptions even with small batch sizes,

My Command:

python3 chainer/train_svhn.py curriculum.json /logs --char-map datasets/svhn/svhn_char_map.json --blank-label 0 -b 8 -g 5

The Error:


  format(optimizer.eps))
epoch       iteration   main/loss   main/accuracy  lr          fast_validation/main/loss  fast_validation/main/accuracy  validation/main/loss  validation/main/accuracy
Exception in main training loop: cudaErrorMemoryAllocation: out of memory
Traceback (most recent call last):............................]  2.37%
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 302, in run
    entry.extension(self)imated time to finish: 2:40:49.439557.
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/chainer/reporter.py", line 98, in scope
    yield
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 302, in run
    entry.extension(self)
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/extensions/log_report.py", line 83, in __call__
    stats_cpu[name] = float(value)  # copy to CPU
  File "cupy/core/core.pyx", line 1642, in cupy.core.core.ndarray.__float__
  File "cupy/core/core.pyx", line 1698, in cupy.core.core.ndarray.get
  File "cupy/cuda/memory.pyx", line 329, in cupy.cuda.memory.MemoryPointer.copy_to_host
  File "cupy/cuda/runtime.pyx", line 257, in cupy.cuda.runtime.memcpy
  File "cupy/cuda/runtime.pyx", line 137, in cupy.cuda.runtime.check_status
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "chainer/train_svhn.py", line 257, in <module>
    trainer.run()
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 313, in run
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 302, in run
    entry.extension(self)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/chainer/reporter.py", line 98, in scope
    yield
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/trainer.py", line 302, in run
    entry.extension(self)
  File "/usr/local/lib/python3.5/dist-packages/chainer/training/extensions/log_report.py", line 83, in __call__
    stats_cpu[name] = float(value)  # copy to CPU
  File "cupy/core/core.pyx", line 1642, in cupy.core.core.ndarray.__float__
  File "cupy/core/core.pyx", line 1698, in cupy.core.core.ndarray.get
  File "cupy/cuda/memory.pyx", line 329, in cupy.cuda.memory.MemoryPointer.copy_to_host
  File "cupy/cuda/runtime.pyx", line 257, in cupy.cuda.runtime.memcpy
  File "cupy/cuda/runtime.pyx", line 137, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

SMI output for ref:

+-------------------------------+----------------------+----------------------+
|   5  Tesla M40           Off  | 00000000:88:00.0 Off |                    0 |
| N/A   30C    P8    17W / 250W |      0MiB / 11443MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Bartzi commented 6 years ago

so far so good, did you make any changes in the code? You should not run out of memory if you are using the original code, the provided svhn data and a batch size like you use.

emushtaq commented 6 years ago

Not really, I made a fresh clone of the repo and just changed the file paths to load the images properly (my docker specific settings where causing a 'could not load file' error)

This is the only change made:

in file_dataset.py

def load_image(self, file_name):
        file_name = os.path.basename(file_name) --> NEW LINE TO CORRECT FILEPATH
        with Image.open(os.path.join(self.base_dir, file_name)) as the_image:

Bartzi commented 6 years ago

Hmm, the only thing I can think of is that some part of the code is keeping a reference to some GPU data... it might help to have a look at the memory usage of the GPU with watch -n 0.5 nvidia-smi and see whether the network seems to be trained for more than one iteration. If that is the case your problem is related to something like that.

Otherwise I don't know really know what is causing your problem...

Bartzi commented 6 years ago

You could try to debug it and run each layer of the network in the debugger and examine the memory usage in order to identify the part where you get that problem...

emushtaq commented 6 years ago

OK, thanks, I will give it a shot. But before that, I will try an alternate GPU. Just to double check.

emushtaq commented 6 years ago

Finally got a trained network with a diff GPU! 🎉 Could be that I had issues with GPU references like you suggested. Unsure though. Thanks for all your time. Next Step, Evaluation and visualizing the results :)

Bartzi commented 6 years ago

I hope all goes well!

kartherion commented 6 years ago

I also meet the same problem in single GPU, I have modified the scripts(train_svhn.py) mentioned above, but the terminal outputs this: cupy.cuda.driver.CUDADriverError: CUDA_ERROR_UNKNOWN: unknown error do you have any idea?

python ../../chainer/train_svhn.py --char-map ./svhn_char_map.json -b 4 ./crops/curriculum.json  ./log/ --blank-label 0 -g 0
Exception in main training loop: CUDA_ERROR_UNKNOWN: unknown error
Traceback (most recent call last):
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 160, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/optimizer.py", line 640, in update
    loss = lossfun(*args, **kwds)
  File "/home/klwang/Data2/SEE/see/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "/home/klwang/Data2/SEE/see/chainer/models/svhn.py", line 209, in __call__
    h = self.localization_net(images)
  File "/home/klwang/Data2/SEE/see/chainer/models/svhn.py", line 41, in __call__
    h = self.bn0(self.conv0(images))
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/links/connection/convolution_2d.py", line 172, in __call__
    self._initialize_params(x.shape[1])
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/links/connection/convolution_2d.py", line 159, in _initialize_params
    self.W.initialize(W_shape)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/variable.py", line 1411, in initialize
    data = initializers.generate_array(self.initializer, shape, xp)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/initializers/__init__.py", line 46, in generate_array
    initializer(array)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/initializers/normal.py", line 68, in __call__
    Normal(s)(array)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/initializers/normal.py", line 36, in __call__
    array[...] = xp.random.normal(**args)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/cupy/random/distributions.py", line 94, in normal
    cupy.multiply(x, scale, out=x)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/cupy/core/fusion.py", line 717, in __call__
    return self._cupy_op(*args, **kwargs)
  File "cupy/core/elementwise.pxi", line 839, in cupy.core.core.ufunc.__call__
  File "cupy/util.pyx", line 39, in cupy.util.memoize.decorator.ret
  File "cupy/core/elementwise.pxi", line 638, in cupy.core.core._get_ufunc_kernel
  File "cupy/core/elementwise.pxi", line 33, in cupy.core.core._get_simple_elementwise_kernel
  File "cupy/core/carray.pxi", line 146, in cupy.core.core.compile_with_cache
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/cupy/cuda/compiler.py", line 166, in compile_with_cache
    ls.add_ptr_data(ptx, six.u('cupy.ptx'))
  File "cupy/cuda/function.pyx", line 203, in cupy.cuda.function.LinkState.add_ptr_data
  File "cupy/cuda/function.pyx", line 205, in cupy.cuda.function.LinkState.add_ptr_data
  File "cupy/cuda/driver.pyx", line 119, in cupy.cuda.driver.linkAddData
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "../../chainer/train_svhn.py", line 257, in <module>
    trainer.run()
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/home/klwang/.local/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/training/updaters/standard_updater.py", line 160, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/optimizer.py", line 640, in update
    loss = lossfun(*args, **kwds)
  File "/home/klwang/Data2/SEE/see/chainer/utils/multi_accuracy_classifier.py", line 44, in __call__
    self.y = self.predictor(*x)
  File "/home/klwang/Data2/SEE/see/chainer/models/svhn.py", line 209, in __call__
    h = self.localization_net(images)
  File "/home/klwang/Data2/SEE/see/chainer/models/svhn.py", line 41, in __call__
    h = self.bn0(self.conv0(images))
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/links/connection/convolution_2d.py", line 172, in __call__
    self._initialize_params(x.shape[1])
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/links/connection/convolution_2d.py", line 159, in _initialize_params
    self.W.initialize(W_shape)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/variable.py", line 1411, in initialize
    data = initializers.generate_array(self.initializer, shape, xp)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/initializers/__init__.py", line 46, in generate_array
    initializer(array)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/initializers/normal.py", line 68, in __call__
    Normal(s)(array)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/chainer/initializers/normal.py", line 36, in __call__
    array[...] = xp.random.normal(**args)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/cupy/random/distributions.py", line 94, in normal
    cupy.multiply(x, scale, out=x)
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/cupy/core/fusion.py", line 717, in __call__
    return self._cupy_op(*args, **kwargs)
  File "cupy/core/elementwise.pxi", line 839, in cupy.core.core.ufunc.__call__
  File "cupy/util.pyx", line 39, in cupy.util.memoize.decorator.ret
  File "cupy/core/elementwise.pxi", line 638, in cupy.core.core._get_ufunc_kernel
  File "cupy/core/elementwise.pxi", line 33, in cupy.core.core._get_simple_elementwise_kernel
  File "cupy/core/carray.pxi", line 146, in cupy.core.core.compile_with_cache
  File "/home/klwang/Software/anaconda2/envs/MXNET3/lib/python3.5/site-packages/cupy/cuda/compiler.py", line 166, in compile_with_cache
    ls.add_ptr_data(ptx, six.u('cupy.ptx'))
  File "cupy/cuda/function.pyx", line 203, in cupy.cuda.function.LinkState.add_ptr_data
  File "cupy/cuda/function.pyx", line 205, in cupy.cuda.function.LinkState.add_ptr_data
  File "cupy/cuda/driver.pyx", line 119, in cupy.cuda.driver.linkAddData
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_UNKNOWN: unknown error

Bartzi commented 6 years ago

can you do nvidia-smi on your machine? Do any of the CUDA examples work?

kartherion commented 6 years ago

I have mad the CUDA success on caffe/mxnet and other structures, the CUDA examples also can be done well. 2018-05-25 16-21-34

Bartzi commented 6 years ago

Hmm, good question then. I think it is because of your environment. Did you check that you have the most recent driver and cudnn for this driver installed? You could tryo to reinstall cupy with verbose output and check for anything that seems odd. But other than that I can not tell you what the problem is.

Bartzi / see

Single GPU training script error #30