CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered #69
(slt4) muhiddin@xvoice:~/SLRT/TwoStreamNetwork$ python -m torch.distributed.launch --nproc_per_node 2 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml
/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79
Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79
2024-08-22 09:53:15,237 Train S3D backbone from scratch
2024-08-22 09:53:15,334 Train S3D backbone from scratch
2024-08-22 09:53:16,479 Load visual_backbone_twostream.rgb_stream and visual_head for rgb from results/phoenix-2014_video/ckpts/best.ckpt
2024-08-22 09:53:16,617 Load visual_backbone_twostream.pose_stream and visual_head for pose from results/phoenix-2014_keypoint/ckpts/best.ckpt
2024-08-22 09:53:16,786 # Total parameters = 105264709
2024-08-22 09:53:16,786 # Total trainable parameters = 105226373
2024-08-22 09:53:19,828 Total #=79
2024-08-22 09:53:21,661 Total #=79
2024-08-22 09:53:23,145 learning rate recognition_network=0.001
2024-08-22 09:53:23,150 Epoch 0, Training examples 5672
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[Training] 2/2836 [..............................] - ETA: 6:06:56Traceback (most recent call last):
File "training.py", line 175, in
output = model(is_train=True, batch)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(inputs[0], kwargs[0])
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/model.py", line 83, in forward
model_outputs = self.recognition_network(is_train=is_train, recognition_inputs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/recognition.py", line 354, in forward
s3d_outputs = self.visual_backbone_twostream(x_rgb=sgn_videos, x_pose=sgn_heatmaps, sgn_lengths=sgn_lengths)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/two_stream.py", line 62, in forward
x_pose = pose_layer(x_pose)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/models_3d/S3D/model.py", line 83, in forward
x = self.conv_s(x)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 590, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 586, in _conv_forward
input, weight, bias, self.stride, self.padding, self.dilation, self.groups
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1640811805959/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7d0f3a039d62 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7d0f7fc1c613 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7d0f7fc1d022 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7d0f3a023314 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x298c09 (0x7d0fd5298c09 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae13a9 (0x7d0fd5ae13a9 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x2b9 (0x7d0fd5ae16c9 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x121ecb (0x5e5962119ecb in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #8: + 0x121ecb (0x5e5962119ecb in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #9: + 0x122218 (0x5e596211a218 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #10: + 0x121677 (0x5e5962119677 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #11: + 0x121328 (0x5e5962119328 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #12: + 0x12136e (0x5e596211936e in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #13: + 0x12136e (0x5e596211936e in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #14: + 0x13f85c (0x5e596213785c in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #15: PyDict_SetItemString + 0x89 (0x5e596213df39 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #16: PyImport_Cleanup + 0xa4 (0x5e596218bfc4 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #17: Py_FinalizeEx + 0x5e (0x5e59621d2d8e in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #18: Py_Main + 0x351 (0x5e59621d5811 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #19: main + 0xe7 (0x5e596210f197 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #20: + 0x29d90 (0x7d1013829d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7d1013829e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1a733e (0x5e596219f33e in /home/muhiddin/miniconda3/envs/slt4/bin/python)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1387007 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1387006) of binary: /home/muhiddin/miniconda3/envs/slt4/bin/python
Traceback (most recent call last):
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
training.py FAILED
Failures:
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-22_09:53:42
host : xvoice
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 1387006)
error_file:
traceback : Signal 6 (SIGABRT) received by PID 1387006
Machines spec is two each one is 24GB) NVIDIA GeForce RTX 3090, Linux Ubuntu, batch_size=1
I wish I could get run the code with your guidines.
Hello! Can you help me with the occurred issue while I run: TwoStream Training To load two pretrained encoders and train the dual visual encoder, run:
python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml
(slt4) muhiddin@xvoice:~/SLRT/TwoStreamNetwork$ python -m torch.distributed.launch --nproc_per_node 2 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionsFutureWarning, WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79 Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79 2024-08-22 09:53:15,237 Train S3D backbone from scratch 2024-08-22 09:53:15,334 Train S3D backbone from scratch 2024-08-22 09:53:16,479 Load visual_backbone_twostream.rgb_stream and visual_head for rgb from results/phoenix-2014_video/ckpts/best.ckpt 2024-08-22 09:53:16,617 Load visual_backbone_twostream.pose_stream and visual_head for pose from results/phoenix-2014_keypoint/ckpts/best.ckpt 2024-08-22 09:53:16,786 # Total parameters = 105264709 2024-08-22 09:53:16,786 # Total trainable parameters = 105226373 2024-08-22 09:53:19,828 Total #=79 2024-08-22 09:53:21,661 Total #=79 2024-08-22 09:53:23,145 learning rate recognition_network=0.001 2024-08-22 09:53:23,150 Epoch 0, Training examples 5672 [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [Training] 2/2836 [..............................] - ETA: 6:06:56Traceback (most recent call last): File "training.py", line 175, in
output = model(is_train=True, batch)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(inputs[0], kwargs[0])
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/model.py", line 83, in forward
model_outputs = self.recognition_network(is_train=is_train, recognition_inputs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/recognition.py", line 354, in forward
s3d_outputs = self.visual_backbone_twostream(x_rgb=sgn_videos, x_pose=sgn_heatmaps, sgn_lengths=sgn_lengths)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/two_stream.py", line 62, in forward
x_pose = pose_layer(x_pose)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/models_3d/S3D/model.py", line 83, in forward
x = self.conv_s(x)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 590, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 586, in _conv_forward
input, weight, bias, self.stride, self.padding, self.dilation, self.groups
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1640811805959/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7d0f3a039d62 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7d0f7fc1c613 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void ) + 0x1a2 (0x7d0f7fc1d022 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7d0f3a023314 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x298c09 (0x7d0fd5298c09 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae13a9 (0x7d0fd5ae13a9 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x2b9 (0x7d0fd5ae16c9 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x121ecb (0x5e5962119ecb in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #8: + 0x121ecb (0x5e5962119ecb in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #9: + 0x122218 (0x5e596211a218 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #10: + 0x121677 (0x5e5962119677 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #11: + 0x121328 (0x5e5962119328 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #12: + 0x12136e (0x5e596211936e in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #13: + 0x12136e (0x5e596211936e in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #14: + 0x13f85c (0x5e596213785c in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #15: PyDict_SetItemString + 0x89 (0x5e596213df39 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #16: PyImport_Cleanup + 0xa4 (0x5e596218bfc4 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #17: Py_FinalizeEx + 0x5e (0x5e59621d2d8e in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #18: Py_Main + 0x351 (0x5e59621d5811 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #19: main + 0xe7 (0x5e596210f197 in /home/muhiddin/miniconda3/envs/slt4/bin/python)
frame #20: + 0x29d90 (0x7d1013829d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7d1013829e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: + 0x1a733e (0x5e596219f33e in /home/muhiddin/miniconda3/envs/slt4/bin/python)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1387007 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1387006) of binary: /home/muhiddin/miniconda3/envs/slt4/bin/python Traceback (most recent call last): File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
training.py FAILED
Failures: