FangyunWei / SLRT

259 stars 56 forks source link

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution #68

Closed muxiddin19 closed 1 month ago

muxiddin19 commented 2 months ago

Hi! I am trying to run TwoStreamNetwork for TwoStream-SLR and need some help. I have done all data downloading with the related preprocessing tasks and even successfully run SingleStream Pretraining: dataset=phoenix-2014t # phoenix-2014t / phoenix-2014 / csl-daily python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_video.yaml #for videos python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_keypoint.yaml #for keypoints However, while I started to run TwoStream Training: python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml This is an issue. I tried to create several different virtual environments but had the same problem. I am using two GPU 3090, 24 GB Linux based servers. I would appreciate your support while I run the code. Here is the error in whole detail:

(slt4) muhiddin@xvoice:~/SLRT/TwoStreamNetwork$ python -m torch.distributed.launch --nproc_per_node 2 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79 Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79 2024-08-21 19:06:00,065 Train S3D backbone from scratch 2024-08-21 19:06:00,161 Train S3D backbone from scratch 2024-08-21 19:06:01,287 Load visual_backbone_twostream.rgb_stream and visual_head for rgb from results/phoenix-2014_video/ckpts/best.ckpt 2024-08-21 19:06:01,442 Load visual_backbone_twostream.pose_stream and visual_head for pose from results/phoenix-2014_keypoint/ckpts/best.ckpt 2024-08-21 19:06:01,609 # Total parameters = 105264709 2024-08-21 19:06:01,610 # Total trainable parameters = 105226373 2024-08-21 19:07:55,099 Total #=79 2024-08-21 19:07:56,965 Total #=79 2024-08-21 19:07:58,433 learning rate recognition_network=0.001 2024-08-21 19:07:58,438 Epoch 0, Training examples 5672 [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [Training] 629/2836 [=====>........................] - ETA: 2:35:50Traceback (most recent call last): File "training.py", line 173, in output = model(is_train=True, batch) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(inputs[0], kwargs[0]) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/model.py", line 83, in forward model_outputs = self.recognition_network(is_train=is_train, recognition_inputs) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/recognition.py", line 354, in forward s3d_outputs = self.visual_backbone_twostream(x_rgb=sgn_videos, x_pose=sgn_heatmaps, sgn_lengths=sgn_lengths) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/two_stream.py", line 97, in forward pose_fea_lst[i-1] = pose_fea_lst[i-1] + self.pose_stream.pyramid.upsample_layersnum_levels-i-1 File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/pyramid.py", line 126, in forward x = self.conv_trans_s(x) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 1072, in forward output_padding, self.groups, self.dilation) RuntimeError: Unable to find a valid cuDNN algorithm to run convolution WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1352078 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1352077) of binary: /home/muhiddin/miniconda3/envs/slt4/bin/python Traceback (most recent call last): File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run )(cmd_args) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-21_19:53:40 host : xvoice rank : 0 (local_rank: 0) exitcode : 1 (pid: 1352077) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================