During training in epoch 1, I observed the following error:
[06/30 01:54:32][INFO] train_net.py: 446: Start epoch: 1
[06/30 01:54:46][INFO] distributed.py: 995: Reducer buckets have been rebuilt in this iteration.
[06/30 01:54:59][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.46034, "dt_data": 0.00346, "dt_net": 1.45688, "epoch": "1/15", "eta": "7:33:55", "gpu_mem": "7.68G", "iter": "10/1244", "loss": 6.05343, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
[06/30 01:55:14][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.50371, "dt_data": 0.00334, "dt_net": 1.50036, "epoch": "1/15", "eta": "7:47:09", "gpu_mem": "7.68G", "iter": "20/1244", "loss": 6.16927, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000}
../aten/src/ATen/native/cuda/Loss.cu:271: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Follow by a lengthy exceptions being raised:
Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10084d2612 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xea8e4a (0x7f1009892e4a in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x33a968 (0x7f1051d51968 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
.....
In my instance I have four Tesla T4 GPU with Driver Version: 510.47.03 CUDA Version: 11.6
What does the error I see above means, and how do I fix it?
I ran the following command:
python tools/run_net.py \ --cfg configs/Kinetics/TimeSformer_divST_8x32_224_4gpus.yaml \ DATA.PATH_TO_DATA_DIR /home/ubuntu/vit/kinetics-dataset/k400/videos_resized \ NUM_GPUS 4 \ TRAIN.BATCH_SIZE 16 \ \
During training in epoch 1, I observed the following error:
[06/30 01:54:32][INFO] train_net.py: 446: Start epoch: 1 [06/30 01:54:46][INFO] distributed.py: 995: Reducer buckets have been rebuilt in this iteration. [06/30 01:54:59][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.46034, "dt_data": 0.00346, "dt_net": 1.45688, "epoch": "1/15", "eta": "7:33:55", "gpu_mem": "7.68G", "iter": "10/1244", "loss": 6.05343, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000} [06/30 01:55:14][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.50371, "dt_data": 0.00334, "dt_net": 1.50036, "epoch": "1/15", "eta": "7:47:09", "gpu_mem": "7.68G", "iter": "20/1244", "loss": 6.16927, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000} ../aten/src/ATen/native/cuda/Loss.cu:271: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion
t >= 0 && t < n_classes
failed. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggeredFollow by a lengthy exceptions being raised: Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10084d2612 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0xea8e4a (0x7f1009892e4a in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x33a968 (0x7f1051d51968 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
.....
In my instance I have four Tesla T4 GPU with Driver Version: 510.47.03 CUDA Version: 11.6
What does the error I see above means, and how do I fix it?