p2ch12 (training.py)- Training stops without error

deep-learning-with-pytorch / dlwpt-code

Code for the book Deep Learning with PyTorch by Eli Stevens, Luca Antiga, and Thomas Viehmann.

https://www.manning.com/books/deep-learning-with-pytorch

4.74k stars 2.01k forks source link

p2ch12 (training.py)- Training stops without error #50

Open ghost opened 3 years ago

ghost commented 3 years ago

When I run "python -m training --balanced --epochs 11", the training process will be shut down automated on epoch 3 without an error message. I try a lot of times and get the same result. It makes me confuse because there is no error message. Environments: Conda: 1.9.12 PyTorch: 1.7.0 Cuda: 10.2 RAM: 32 GB GPU: RTX 2080 Ti I think I meet the same issue on #17. 2021-01-20 144504

melhzy commented 3 years ago

Same issue for me. I ran the code in Jupyter notebook. It said kernel dead. But when I run the training in cmd, there's no any error messages. Also, the same issue, but not answered earlier: https://github.com/deep-learning-with-pytorch/dlwpt-code/issues/17

I set the training epoch as 20, but the training always stops at 2.

navpreetnp7 commented 3 years ago

@Russell-Chang @melhzy Hello, I created the issue #17 . I was able to resolve it by reducing the number of workers and batch size. The issue is caused due to your running out of memory. It doesnt show in task manager exactly but that is whats happening. If you want to look you can click on the 3d/copy button in the performance tab on the task manager and click cuda. You can experiment with the batch size and number of workers to go as high as you can without crashing. This will depend on the capability of your system. I used 32 batch size with 4 workers and the training completed for me although it took some hours.

melhzy commented 3 years ago

Thanks, @navpreetnp7 . I see my cuda remains at 99% while training. Do you mean if the algorithm pushes cuda to more than 100%, the Python kernel will be forced to stop? gpu

EliottGDFY commented 1 year ago

You can try to test if the error comes from your dataloader containing hidden files like .ipynb_checkpoints, try to write a script to loop over your dataloader and see if it crashes.