Solacex / CCM

[ECCV2020] Content-Consistent Matching for Domain Adaptive Semantic Segmentation
MIT License
94 stars 10 forks source link

Training error: RuntimeError: For non-complex input tensors, argument alpha must not be a complex number. #18

Open hosea7456 opened 2 years ago

hosea7456 commented 2 years ago

Hi, thanks for your great jobs! When I try to train a model, there was an error like that:


Traceback (most recent call last): File "so_run.py", line 51, in main() File "so_run.py", line 43, in main trainer.train() File "/home/CCM/trainer/source_only_trainer.py", line 58, in train self.optim.step() File /home/anaconda3/envs/torch1.9/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper return func(*args, *kwargs) File "/home/anaconda3/envs/torch1.9/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, **kwargs) File "/home/anaconda3/envs/torch1.9/lib/python3.8/site-packages/torch/optim/sgd.py", line 110, in step F.sgd(params_with_grad, File "/home/anaconda3/envs/torch1.9/lib/python3.8/site-packages/torch/optim/functional.py", line 180, in sgd param.add(d_p, alpha=-lr) RuntimeError: For non-complex input tensors, argument alpha must not be a complex number.


How should I fix it? Thank you. And my config used to train is:


note: 'train'

configs of data

model: 'deeplab' train: True multigpu: False fixbn: True fix_seed: True

Optimizaers

learning_rate: 7.5e-5 num_steps: 5000 epochs: 2 weight_decay: 0.0005 momentum: 0.9 power: 0.9 round: 6

Logging

print_freq: 1 save_freq: 2000 tensorboard: False neptune: False screen: True val: False val_freq: 300

Dataset

source: 'gta5' target: 'cityscapes' worker: 0 batch_size: 2

Transforms

input_src: 720 input_tgt: 720 crop_src: 600 crop_tgt: 600 mirror: True scale_min: 0.5 scale_max: 1.5 rec: False

Model hypers

init_weight: './pretrained/DeepLab_resnet_pretrained_init-f81d91e8.pth' restore_from: None

snapshot: './Data/snapshot/' result: './miou_result/' log: './log/' plabel: './plabel' gta5: { data_dir: '/home/data/datasets/GTA5/', data_list: './dataset/list/gta5_list.txt', input_size: [1280, 720] } synthia: { data_dir: '/home/guangrui/data/synthia/', data_list: './dataset/list/synthia_list.txt', input_size: [1280, 760] } cityscapes: { data_dir: '/home/data/datasets/Cityscapes', data_list: './dataset/list/cityscapes_train.txt', input_size: [1024, 512] }

Solacex commented 2 years ago

Hello,

Thanks for your interest on our work! I tried to locate the problem you post but failed. But I postulate that the error is caused by the new version of pytorch, so I think using pytorch=1.7.0 may helps.

Hope it helps.

hosea7456 commented 2 years ago

Hello,

Thanks for your interest on our work! I tried to locate the problem you post but failed. But I postulate that the error is caused by the new version of pytorch, so I think using pytorch=1.7.0 may helps.

Hope it helps.

Hi, thanks for your advise. I have tried the version of pytorch==1.7.0, the before error was disappeared but another error is appaer:

Traceback (most recent call last): File "so_run.py", line 51, in main() File "so_run.py", line 43, in main trainer.train() File "/home/CCM/trainer/source_only_trainer.py", line 58, in train self.optim.step() File "/home/anaconda3/envs/torch1.7/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decoratecontext return func(*args, **kwargs) File "/home/anaconda3/envs/torch1.7/lib/python3.8/site-packages/torch/optim/sgd.py", line 112, in step p.add(d_p, alpha=-group['lr']) RuntimeError: value cannot be converted to type float without overflow: (2.10957e-06,-6.85442e-07)

I have no idea at all

Solacex commented 2 years ago

Hello As far as I can postulate, it maybe because the training steps exceeds the max steps of the optimizer. You can check it..

Jo-wang commented 2 years ago

Same error here, and I've tried to increase num_steps in so_config.yaml but it didn't work. Could you provide the parameter that you use to train source-only model? Thank you!

Jo-wang commented 2 years ago

Hi, I just solved that several days ago. The error caused by the fixed max number of steps in adjusting learning rate. You can have a check if it's work. Cheers, zx

Hyx098130 commented 1 year ago

嗨,我几天前刚刚解决了这个问题。调整学习率时固定的最大步数引起的错误。您可以检查它是否有效。干杯,zx

I also encountered this problem recently, can you elaborate on how to solve it? Thank you very much

Jo-wang commented 1 year ago

嗨,我几天前刚刚解决了这个问题。调整学习率时固定的最大步数引起的错误。您可以检查它是否有效。干杯,zx

I also encountered this problem recently, can you elaborate on how to solve it? Thank you very much

Hi there, Sorry for the late reply. The issue is coming from the incorrect max step during optimizating. Here is my version:

def adjust_learning_rate(optimizer, i_iter, len_loader, args):
    lr = lr_poly(args.learning_rate, i_iter, args.epochs*len_loader, args.power)
    optimizer.param_groups[0]['lr'] = lr
    if len(optimizer.param_groups) > 1:
        optimizer.param_groups[1]['lr'] = lr * 10
    return lr

Hope this could help.

Zx