Hi! I am trying to run TwoStreamNetwork for TwoStream-SLR and need some help.
I have done all data downloading with the related preprocessing tasks and even successfully run
SingleStream Pretraining:
dataset=phoenix-2014t # phoenix-2014t / phoenix-2014 / csl-daily
python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_video.yaml #for videos
python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_keypoint.yaml #for keypoints
However, while I started to run
TwoStream Training:
python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml
This is an issue. I tried to create several different virtual environments but had the same problem. I am using two GPU 3090, 24 GB Linux based servers. I would appreciate your support while I run the code. Here is the error in whole detail:
(slt4) muhiddin@xvoice:~/SLRT/TwoStreamNetwork$ python -m torch.distributed.launch --nproc_per_node 2 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml
/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79
Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79
2024-08-21 19:06:00,065 Train S3D backbone from scratch
2024-08-21 19:06:00,161 Train S3D backbone from scratch
2024-08-21 19:06:01,287 Load visual_backbone_twostream.rgb_stream and visual_head for rgb from results/phoenix-2014_video/ckpts/best.ckpt
2024-08-21 19:06:01,442 Load visual_backbone_twostream.pose_stream and visual_head for pose from results/phoenix-2014_keypoint/ckpts/best.ckpt
2024-08-21 19:06:01,609 # Total parameters = 105264709
2024-08-21 19:06:01,610 # Total trainable parameters = 105226373
2024-08-21 19:07:55,099 Total #=79
2024-08-21 19:07:56,965 Total #=79
2024-08-21 19:07:58,433 learning rate recognition_network=0.001
2024-08-21 19:07:58,438 Epoch 0, Training examples 5672
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[Training] 629/2836 [=====>........................] - ETA: 2:35:50Traceback (most recent call last):
File "training.py", line 173, in
output = model(is_train=True, batch)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(inputs[0], kwargs[0])
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/model.py", line 83, in forward
model_outputs = self.recognition_network(is_train=is_train, recognition_inputs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/recognition.py", line 354, in forward
s3d_outputs = self.visual_backbone_twostream(x_rgb=sgn_videos, x_pose=sgn_heatmaps, sgn_lengths=sgn_lengths)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/two_stream.py", line 97, in forward
pose_fea_lst[i-1] = pose_fea_lst[i-1] + self.pose_stream.pyramid.upsample_layersnum_levels-i-1
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/pyramid.py", line 126, in forward
x = self.conv_trans_s(x)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 1072, in forward
output_padding, self.groups, self.dilation)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1352078 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1352077) of binary: /home/muhiddin/miniconda3/envs/slt4/bin/python
Traceback (most recent call last):
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(cmd_args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Hi! I am trying to run TwoStreamNetwork for TwoStream-SLR and need some help. I have done all data downloading with the related preprocessing tasks and even successfully run SingleStream Pretraining: dataset=phoenix-2014t # phoenix-2014t / phoenix-2014 / csl-daily python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_video.yaml #for videos python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_keypoint.yaml #for keypoints However, while I started to run TwoStream Training: python -m torch.distributed.launch --nproc_per_node 8 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml This is an issue. I tried to create several different virtual environments but had the same problem. I am using two GPU 3090, 24 GB Linux based servers. I would appreciate your support while I run the code. Here is the error in whole detail:
(slt4) muhiddin@xvoice:~/SLRT/TwoStreamNetwork$ python -m torch.distributed.launch --nproc_per_node 2 --use_env training.py --config experiments/configs/TwoStream/${dataset}_s2g.yaml /home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionsFutureWarning, WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79 Overwrite cfg.model.RecognitionNetwork.keypoint_s3d.in_channel -> 79 2024-08-21 19:06:00,065 Train S3D backbone from scratch 2024-08-21 19:06:00,161 Train S3D backbone from scratch 2024-08-21 19:06:01,287 Load visual_backbone_twostream.rgb_stream and visual_head for rgb from results/phoenix-2014_video/ckpts/best.ckpt 2024-08-21 19:06:01,442 Load visual_backbone_twostream.pose_stream and visual_head for pose from results/phoenix-2014_keypoint/ckpts/best.ckpt 2024-08-21 19:06:01,609 # Total parameters = 105264709 2024-08-21 19:06:01,610 # Total trainable parameters = 105226373 2024-08-21 19:07:55,099 Total #=79 2024-08-21 19:07:56,965 Total #=79 2024-08-21 19:07:58,433 learning rate recognition_network=0.001 2024-08-21 19:07:58,438 Epoch 0, Training examples 5672 [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [Training] 629/2836 [=====>........................] - ETA: 2:35:50Traceback (most recent call last): File "training.py", line 173, in
output = model(is_train=True, batch)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(inputs[0], kwargs[0])
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/model.py", line 83, in forward
model_outputs = self.recognition_network(is_train=is_train, recognition_inputs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/recognition.py", line 354, in forward
s3d_outputs = self.visual_backbone_twostream(x_rgb=sgn_videos, x_pose=sgn_heatmaps, sgn_lengths=sgn_lengths)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/two_stream.py", line 97, in forward
pose_fea_lst[i-1] = pose_fea_lst[i-1] + self.pose_stream.pyramid.upsample_layersnum_levels-i-1
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/muhiddin/SLRT/TwoStreamNetwork/modelling/pyramid.py", line 126, in forward
x = self.conv_trans_s(x)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 1072, in forward
output_padding, self.groups, self.dilation)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1352078 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1352077) of binary: /home/muhiddin/miniconda3/envs/slt4/bin/python
Traceback (most recent call last):
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)( cmd_args)
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/muhiddin/miniconda3/envs/slt4/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
training.py FAILED
Failures: