hoagthien commented 2 months ago

I have some problem with backward issue Pls help me to fix it Error """ /root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( [2024-07-04 02:48:40,721][ INFO] {'backbone': 'resnet101', 'batch_size': 2, 'criterion': {'kwargs': {'ignore_index': 255}, 'name': 'CELoss'}, 'crop_size': 321, 'data_root': './data/VOC2012', 'dataset': 'pascal', 'dilations': [12, 24, 36], 'epochs': 80, 'lr': 0.001, 'lr_multi': 10.0, 'multi_grid': False, 'nclass': 21, 'pretrain': True, 'replace_stride_with_dilation': [False, True, True], 'thresh_init': 0.85}

[2024-07-04 02:48:42,020][ INFO] Total params: 64.2M

[2024-07-04 02:48:44,331][ INFO] ===========> Epoch: 0, LR: 0.0010, Previous best: 0.00 0%| | 0/4925 [00:00<?, ?it/s]/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/autograd/init.py:197: UserWarning: Error detected in CudnnBatchNormBackward0. Traceback of forward call that caused the error: File "/root/thn/semi/CorrMatch/corrmatch.py", line 313, in main() File "/root/thn/semi/CorrMatch/corrmatch.py", line 167, in main res_w = model(torch.cat((img_x, img_u_w)), need_fp=True, use_corr=True) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/root/thn/semi/CorrMatch/model/semseg/deeplabv3plus.py", line 60, in forward feats_decode = self._decode(torch.cat((c1, nn.Dropout2d(0.5)(c1))), torch.cat((c4, nn.Dropout2d(0.5)(c4)))) File "/root/thn/semi/CorrMatch/model/semseg/deeplabv3plus.py", line 96, in _decode feature = self.fuse(feature) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward input = module(input) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 740, in forward return F.batch_norm( File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm return torch.batch_norm( File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack return traceback.format_stack() (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "/root/thn/semi/CorrMatch/corrmatch.py", line 313, in main() File "/root/thn/semi/CorrMatch/corrmatch.py", line 261, in main loss.backward() File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! 0%| | 0/4925 [00:16<?, ?it/s] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 722706) of binary: /root/thn/miniconda3/envs/semi/bin/python Traceback (most recent call last): File "/root/thn/miniconda3/envs/semi/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/thn/miniconda3/envs/semi/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in main() File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: """

BBBBchan commented 2 months ago

Thanks for your attention. Looks like this issue is related with in-place operation on tensors. Maybe it is caused by nn.Dropout2d(). However, I haven't met this problem during experiments and couldn't reproduce your problem with my environment. Thus, could you please share more information?

Your change to the code if you have any.
Your environment information, especially pytorch version.

hoagthien commented 2 months ago

I just have replace some code in main function of corrmatch.py to set up my gpu device Previsous code: 'os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"'

Changed code: 'os.environ["LOCAL_RANK"] = '{}'.format(args.local_rank) torch.cuda.set_device(args.local_rank)'

My torch and cuda version: 'torch==1.13.1+cu116 torchaudio==0.13.1+cu116 torchvision==0.14.1+cu116'

I also change the bash train.sh like that: '#!/bin/bash now=$(date +"%Y%m%d_%H%M%S")

dataset='pascal' method='corrmatch' exp='r101' split='732'

config=configs/${dataset}.yaml labeled_id_path=partitions/$dataset/$split/labeled.txt unlabeled_id_path=partitions/$dataset/$split/unlabeled.txt save_path=exp/$dataset/$method/$exp/$split

config=configs/cityscapes.yaml

labeled_id_path=partitions/cityscapes/1_4/labeled.txt

unlabeled_id_path=partitions/cityscapes/1_4/unlabeled.txt

save_path=exp/cityscapes/1_4/corrmatch

mkdir -p $save_path

python -m torch.distributed.launch \ --nproc_per_node=$1 \ --master_addr=localhost \ --master_port=$2 \ $method.py \ --config=$config --labeled-id-path $labeled_id_path --unlabeled-id-path $unlabeled_id_path \ --local_rank $3 --save-path $save_path --port $2 2>&1 | tee $save_path/$now.log'

hoagthien commented 2 months ago

Additionally, when I set up the env by CorrMatch's instruction, it still has this error

BBBBchan commented 2 months ago

Hi there. I have noticed that in your train.sh, you manually specified the local rank instead of using the default setting in DDP. How did you launch the train.sh? If you use sh tools/train.sh <gpu num> <port> <local rank> like sh tools/train.sh 2 23555 0, all the GPU devices would be regarded as local rank 0. This would lead to inconsistencies issues.

In our default train.sh, we did not specify the local rank manually and let the DDP launcher set it automaticly.

hoagthien commented 2 months ago

I run exp on a single gpu so I put this code for setting specific gpu device. For ex: sh tools/train.sh 1 23555 0

hoagthien commented 2 months ago

Oh, I have used the original code with distributed training(2 gpu) and it work Thanks for your support

BBBBchan / CorrMatch

Backward issue #16

config=configs/cityscapes.yaml

labeled_id_path=partitions/cityscapes/1_4/labeled.txt

unlabeled_id_path=partitions/cityscapes/1_4/unlabeled.txt

save_path=exp/cityscapes/1_4/corrmatch