Official code for "CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation"
Backward issue #16

Closed hoagthien closed 2 months ago

hoagthien commented 2 months ago

I have some problem with backward issue Pls help me to fix it Error """ /root/thn/miniconda3/envs/semi/lib/python3.9/site-packages/torch/distributed/ FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See for further instructions

warnings.warn( [2024-07-04 02:48:40,721][ INFO] {'backbone': 'resnet101', 'batch_size': 2, 'criterion': {'kwargs': {'ignore_index': 255}, 'name': 'CELoss'}, 'crop_size': 321, 'data_root': './data/VOC2012', 'dataset': 'pascal', 'dilations': [12, 24, 36], 'epochs': 80, 'lr': 0.001, 'lr_multi': 10.0, 'multi_grid': False, 'nclass': 21, 'pretrain': True, 'replace_stride_with_dilation': [False, True, True], 'thresh_init': 0.85}

[2024-07-04 02:48:42,020][ INFO] Total params: 64.2M

[2024-07-04 02:48:44,331][ INFO] ===========> Epoch: 0, LR: 0.0010, Previous best: 0.00
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

BBBBchan commented 2 months ago

Thanks for your attention. Looks like this issue is related with in-place operation on tensors. Maybe it is caused by nn.Dropout2d(). However, I haven't met this problem during experiments and couldn't reproduce your problem with my environment. Thus, could you please share more information?

  1. Your change to the code if you have any.
  2. Your environment information, especially pytorch version.
hoagthien commented 2 months ago

I just have replace some code in main function of to set up my gpu device Previsous code: 'os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"'

Changed code: 'os.environ["LOCAL_RANK"] = '{}'.format(args.local_rank) torch.cuda.set_device(args.local_rank)'

My torch and cuda version: 'torch==1.13.1+cu116 torchaudio==0.13.1+cu116 torchvision==0.14.1+cu116'

I also change the bash like that: '#!/bin/bash now=$(date +"%Y%m%d_%H%M%S")

dataset='pascal' method='corrmatch' exp='r101' split='732'

config=configs/${dataset}.yaml labeled_id_path=partitions/$dataset/$split/labeled.txt unlabeled_id_path=partitions/$dataset/$split/unlabeled.txt save_path=exp/$dataset/$method/$exp/$split





mkdir -p $save_path

python -m torch.distributed.launch \ --nproc_per_node=$1 \ --master_addr=localhost \ --master_port=$2 \ $ \ --config=$config --labeled-id-path $labeled_id_path --unlabeled-id-path $unlabeled_id_path \ --local_rank $3 --save-path $save_path --port $2 2>&1 | tee $save_path/$now.log'

hoagthien commented 2 months ago

Additionally, when I set up the env by CorrMatch's instruction, it still has this error

BBBBchan commented 2 months ago

Hi there. I have noticed that in your, you manually specified the local rank instead of using the default setting in DDP. How did you launch the If you use sh tools/ <gpu num> <port> <local rank> like sh tools/ 2 23555 0, all the GPU devices would be regarded as local rank 0. This would lead to inconsistencies issues.

In our default, we did not specify the local rank manually and let the DDP launcher set it automaticly.

hoagthien commented 2 months ago

I run exp on a single gpu so I put this code for setting specific gpu device. For ex: sh tools/ 1 23555 0

hoagthien commented 2 months ago

Oh, I have used the original code with distributed training(2 gpu) and it work Thanks for your support