Closed hoagthien closed 2 months ago
Thanks for your attention. Looks like this issue is related with in-place operation on tensors. Maybe it is caused by nn.Dropout2d(). However, I haven't met this problem during experiments and couldn't reproduce your problem with my environment. Thus, could you please share more information?
I just have replace some code in main function of corrmatch.py to set up my gpu device Previsous code: 'os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"'
Changed code: 'os.environ["LOCAL_RANK"] = '{}'.format(args.local_rank) torch.cuda.set_device(args.local_rank)'
My torch and cuda version: 'torch==1.13.1+cu116 torchaudio==0.13.1+cu116 torchvision==0.14.1+cu116'
I also change the bash train.sh like that: '#!/bin/bash now=$(date +"%Y%m%d_%H%M%S")
dataset='pascal' method='corrmatch' exp='r101' split='732'
config=configs/${dataset}.yaml labeled_id_path=partitions/$dataset/$split/labeled.txt unlabeled_id_path=partitions/$dataset/$split/unlabeled.txt save_path=exp/$dataset/$method/$exp/$split
mkdir -p $save_path
python -m torch.distributed.launch \ --nproc_per_node=$1 \ --master_addr=localhost \ --master_port=$2 \ $method.py \ --config=$config --labeled-id-path $labeled_id_path --unlabeled-id-path $unlabeled_id_path \ --local_rank $3 --save-path $save_path --port $2 2>&1 | tee $save_path/$now.log'
Additionally, when I set up the env by CorrMatch's instruction, it still has this error
Hi there. I have noticed that in your train.sh
, you manually specified the local rank instead of using the default setting in DDP. How did you launch the train.sh
? If you use sh tools/train.sh <gpu num> <port> <local rank>
like sh tools/train.sh 2 23555 0
, all the GPU devices would be regarded as local rank 0. This would lead to inconsistencies issues.
In our default train.sh
, we did not specify the local rank manually and let the DDP launcher set it automaticly.
I run exp on a single gpu so I put this code for setting specific gpu device.
For ex: sh tools/train.sh 1 23555 0
Oh, I have used the original code with distributed training(2 gpu) and it work Thanks for your support
I have some problem with backward issue Pls help me to fix it Error """ /root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( [2024-07-04 02:48:40,721][ INFO] {'backbone': 'resnet101', 'batch_size': 2, 'criterion': {'kwargs': {'ignore_index': 255}, 'name': 'CELoss'}, 'crop_size': 321, 'data_root': './data/VOC2012', 'dataset': 'pascal', 'dilations': [12, 24, 36], 'epochs': 80, 'lr': 0.001, 'lr_multi': 10.0, 'multi_grid': False, 'nclass': 21, 'pretrain': True, 'replace_stride_with_dilation': [False, True, True], 'thresh_init': 0.85}
[2024-07-04 02:48:42,020][ INFO] Total params: 64.2M
[2024-07-04 02:48:44,331][ INFO] ===========> Epoch: 0, LR: 0.0010, Previous best: 0.00 0%| | 0/4925 [00:00<?, ?it/s]/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/autograd/init.py:197: UserWarning: Error detected in CudnnBatchNormBackward0. Traceback of forward call that caused the error: File "/root/thn/semi/CorrMatch/corrmatch.py", line 313, in
main()
File "/root/thn/semi/CorrMatch/corrmatch.py", line 167, in main
res_w = model(torch.cat((img_x, img_u_w)), need_fp=True, use_corr=True)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, *kwargs)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(inputs[0], kwargs[0])
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/root/thn/semi/CorrMatch/model/semseg/deeplabv3plus.py", line 60, in forward
feats_decode = self._decode(torch.cat((c1, nn.Dropout2d(0.5)(c1))), torch.cat((c4, nn.Dropout2d(0.5)(c4))))
File "/root/thn/semi/CorrMatch/model/semseg/deeplabv3plus.py", line 96, in _decode
feature = self.fuse(feature)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 740, in forward
return F.batch_norm(
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm
return torch.batch_norm(
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/root/thn/semi/CorrMatch/corrmatch.py", line 313, in
main()
File "/root/thn/semi/CorrMatch/corrmatch.py", line 261, in main
loss.backward()
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
0%| | 0/4925 [00:16<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 722706) of binary: /root/thn/miniconda3/envs/semi/bin/python
Traceback (most recent call last):
File "/root/thn/miniconda3/envs/semi/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/thn/miniconda3/envs/semi/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
"""