NVIDIA-AI-IOT / jetbot

An educational AI robot based on NVIDIA Jetson Nano.
MIT License
3k stars 1.02k forks source link

Collision Avoidance - Train_Model.ipynb - RuntimeError: CUDA error: device-side assert triggered #111

Closed AngelCrusher closed 2 years ago

AngelCrusher commented 5 years ago

We keep getting the same error following the Collision Avoidance - Train_Model.ipynb notebook at the last step "train the neural network for 30 epochs". Have tried deleting the unziped dataset and deleted alexnet model (cd ~/.torch/models, rm alexnet*.pth) and starting again but the error is the same. The alexnet download seemed to fail at least once and the last few tries showed 244418560.0 bytes at end.


RuntimeError Traceback (most recent call last)

in 13 outputs = model(images) 14 loss = F.cross_entropy(outputs, labels) ---> 15 loss.backward() 16 optimizer.step() 17 /usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph) 100 products. Defaults to ``False``. 101 """ --> 102 torch.autograd.backward(self, gradient, retain_graph, create_graph) 103 104 def register_hook(self, hook): /usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 88 Variable._execution_engine.run_backward( 89 tensors, grad_tensors, retain_graph, create_graph, ---> 90 allow_unreachable=True) # allow_unreachable flag 91 92 RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/nvidia/Downloads/tmp/pytorch/aten/src/THC/generic/THCTensorMath.cu:24
HannahZhangVW commented 5 years ago

I had same error. It turns out that there are hidden file ".ipyxxxx" under dataset folder and its sub-folders. These hidden files can't be seen on the browser jupyter notebook. I download the dataset via Firefox browser, and deleted all the hidden files under dataset on my PC. I also deleted all the dataset and dataset.zip on the jetbot. I uploaded the cleaned dataset.zip to jetbot. The error is gone.

dkokron commented 4 years ago

Removing those .ipynb_checkpoints directories fixed this issue for me. My free/blocked directories also had some zero length .jpg files too.

greenfan commented 4 years ago

Removing all hidden files from within the dataset directory resolved my runtime error of:

---> 15 loss.backward() RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/nvidia/Downloads/tmp/pytorch/aten/src/THC/generic/THCTensorMath.cu:24

While training collision avoidance dataset on my jetbot. Thanks Dkokron and Hannah.