NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.38k forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #682

Open ghost opened 4 years ago

ghost commented 4 years ago

When I run my own program without using mixed precision training, it works well. But when I run it with mixed precision training, I get this message. Traceback (most recent call last): File "train_Selective_Net_GoPro.py", line 118, in <module> main(args) File "train_Selective_Net_GoPro.py", line 77, in main scale_loss.backward() File "/home/grd/miniconda3/envs/torch1.1/lib/python3.7/contextlib.py", line 119, in __exit__ next(self.gen) File "/home/grd/miniconda3/envs/torch1.1/lib/python3.7/site-packages/apex/amp/handle.py", line 127, in scale_loss should_skip = False if delay_overflow_check else loss_scaler.update_scale() File "/home/grd/miniconda3/envs/torch1.1/lib/python3.7/site-packages/apex/amp/scaler.py", line 200, in update_scale self._has_overflow = self._overflow_buf.item() RuntimeError: CUDA error: an illegal memory access was encountered

YuhaoYeSteve commented 4 years ago

have same problem

tstandley commented 4 years ago

Me too. Happens when I try to train mnasnet with O1 and --channels_last