FangyunWei / SLRT

259 stars 56 forks source link

CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered #69

Closed muxiddin19 closed 1 month ago

muxiddin19 commented 2 months ago

Hello! Can you help me with the occurred issue while I run: TwoStream Training To load two pretrained encoders and train the dual visual encoder, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml

(slt4) muhiddin@xvoice:~/SLRT/TwoStreamNetwork$ python -m torch.distributed.launch --nproc_per_node 2 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79 Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79 2024-08-22 09:53:15,237 Train S3D backbone from scratch 2024-08-22 09:53:15,334 Train S3D backbone from scratch 2024-08-22 09:53:16,479 Load visual_backbone_twostream.rgb_stream and visual_head for rgb from results/phoenix-2014_video/ckpts/best.ckpt 2024-08-22 09:53:16,617 Load visual_backbone_twostream.pose_stream and visual_head for pose from results/phoenix-2014_keypoint/ckpts/best.ckpt 2024-08-22 09:53:16,786 # Total parameters = 105264709 2024-08-22 09:53:16,786 # Total trainable parameters = 105226373 2024-08-22 09:53:19,828 Total #=79 2024-08-22 09:53:21,661 Total #=79 2024-08-22 09:53:23,145 learning rate recognition_network=0.001 2024-08-22 09:53:23,150 Epoch 0, Training examples 5672 [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [Training] 2/2836 [..............................] - ETA: 6:06:56Traceback (most recent call last): File "training.py", line 175, in output = model(is_train=True, batch) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(inputs[0], kwargs[0]) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/model.py", line 83, in forward model_outputs = self.recognition_network(is_train=is_train, recognition_inputs) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/recognition.py", line 354, in forward s3d_outputs = self.visual_backbone_twostream(x_rgb=sgn_videos, x_pose=sgn_heatmaps, sgn_lengths=sgn_lengths) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/two_stream.py", line 62, in forward x_pose = pose_layer(x_pose) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/models_3d/S3D/model.py", line 83, in forward x = self.conv_s(x) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 590, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 586, in _conv_forward input, weight, bias, self.stride, self.padding, self.dilation, self.groups RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1640811805959/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7d0f3a039d62 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: + 0x1c613 (0x7d0f7fc1c613 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7d0f7fc1d022 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7d0f3a023314 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libc10.so) frame #4: + 0x298c09 (0x7d0fd5298c09 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xae13a9 (0x7d0fd5ae13a9 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #6: THPVariable_subclass_dealloc(_object*) + 0x2b9 (0x7d0fd5ae16c9 in /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: + 0x121ecb (0x5e5962119ecb in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #8: + 0x121ecb (0x5e5962119ecb in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #9: + 0x122218 (0x5e596211a218 in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #10: + 0x121677 (0x5e5962119677 in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #11: + 0x121328 (0x5e5962119328 in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #12: + 0x12136e (0x5e596211936e in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #13: + 0x12136e (0x5e596211936e in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #14: + 0x13f85c (0x5e596213785c in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #15: PyDict_SetItemString + 0x89 (0x5e596213df39 in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #16: PyImport_Cleanup + 0xa4 (0x5e596218bfc4 in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #17: Py_FinalizeEx + 0x5e (0x5e59621d2d8e in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #18: Py_Main + 0x351 (0x5e59621d5811 in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #19: main + 0xe7 (0x5e596210f197 in /home/muhiddin/miniconda3/envs/slt4/bin/python) frame #20: + 0x29d90 (0x7d1013829d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7d1013829e40 in /lib/x86_64-linux-gnu/libc.so.6) frame #22: + 0x1a733e (0x5e596219f33e in /home/muhiddin/miniconda3/envs/slt4/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1387007 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1387006) of binary: /home/muhiddin/miniconda3/envs/slt4/bin/python Traceback (most recent call last): File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training.py FAILED

Failures:

-------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-08-22_09:53:42 host : xvoice rank : 0 (local_rank: 0) exitcode : -6 (pid: 1387006) error_file: traceback : Signal 6 (SIGABRT) received by PID 1387006 Machines spec is two each one is 24GB) NVIDIA GeForce RTX 3090, Linux Ubuntu, batch_size=1 I wish I could get run the code with your guidines.