question about --continue training

kinfeparty commented 4 years ago

Hello, thanks for your nice work. I met a bug on --continue training.

python main.py --model deeplabv3plus_mobilenet --dataset cityscapes --gpu_id 6 --lr 0.1 --crop_size 768 --batch_size 12 --output_stride 16 --data_root ./datasets/data/cityscapes --ckpt checkpoints/best_deeplabv3plus_mobilenet_cityscapes_os16.pth --continue_training

J9R5PWZFB6({0T6096AE%0V

Can you fix it?

VainF commented 4 years ago

Hi @kinfeparty , I added the missing map_location in the latest commit. Please try again.

kinfeparty commented 4 years ago

Hi @VainF ,I modified the code but met the same bug.

PytaichukBohdan commented 4 years ago

Hi @VainF , got the same issue. Do you know what it can be related to?

PytaichukBohdan commented 4 years ago

@kinfeparty @VainF Found the issue.

According to Pytorch optimizer documentation,

if you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.

It is fixed by moving model to cuda before loading state dict to optimizer:

` if opts.ckpt is not None and os.path.isfile(opts.ckpt):

    checkpoint = torch.load(opts.ckpt, map_location=torch.device('cpu'))
    # checkpoint = torch.load(opts.ckpt)
    model.load_state_dict(checkpoint["model_state"])

    model = nn.DataParallel(model)
    model.to(device)

    if opts.continue_training:
        optimizer.load_state_dict(checkpoint["optimizer_state"])
        scheduler.load_state_dict(checkpoint["scheduler_state"])
        cur_itrs = checkpoint["cur_itrs"]
        best_score = checkpoint['best_score']
        print("Training state restored from %s" % opts.ckpt)
    print("Model restored from %s" % opts.ckpt)
    del checkpoint  # free memory
else:
    print("[!] Retrain")

    model = nn.DataParallel(model)
    model.to(device)`

VainF commented 4 years ago

@PytaichukBohdan thanks!

YLiu-creator commented 4 years ago

when continue training, the ASPPPooling met the error: Original Traceback (most recent call last): File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, kwargs) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/utils.py", line 16, in forward x = self.classifier(features) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 84, in forward low_output_feature= self.aspp(low_level_beforeFPM) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 265, in forward res.append(conv(x)) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 233, in forward x = super(ASPPPooling, self).forward(x) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward exponential_average_factor, self.eps) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/functional.py", line 1652, in batch_norm raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

ASPPPooling worked when retraining
I don't know how to debug, please give some help.

YLiu-creator commented 4 years ago

when continue training, the ASPPPooling met the error: Original Traceback (most recent call last): File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, kwargs) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/utils.py", line 16, in forward x = self.classifier(features) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call* result = self.forward(input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 84, in forward low_output_feature= self.aspp(low_level_beforeFPM) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 265, in forward res.append(conv(x)) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 233, in forward x = super(ASPPPooling, self).forward(x) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call* result = self.forward(input, kwargs) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward exponential_average_factor, self.eps) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/functional.py", line 1652, in batch_norm raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

ASPPPooling worked when retraining I don't know how to debug, please give some help.

I konw the "1" was caused by AdaptiveAvgPool2d, but why only except error in continue training?

longphamkhac commented 3 years ago

How can my output segmentation image be the same as the second image, tks sir very much

VainF / DeepLabV3Plus-Pytorch

question about --continue training #8