Open kinfeparty opened 4 years ago
Hi @kinfeparty , I added the missing map_location
in the latest commit. Please try again.
Hi @VainF ,I modified the code but met the same bug.
Hi @VainF , got the same issue. Do you know what it can be related to?
@kinfeparty @VainF Found the issue.
According to Pytorch optimizer documentation,
if you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.
It is fixed by moving model to cuda before loading state dict to optimizer:
` if opts.ckpt is not None and os.path.isfile(opts.ckpt):
checkpoint = torch.load(opts.ckpt, map_location=torch.device('cpu'))
# checkpoint = torch.load(opts.ckpt)
model.load_state_dict(checkpoint["model_state"])
model = nn.DataParallel(model)
model.to(device)
if opts.continue_training:
optimizer.load_state_dict(checkpoint["optimizer_state"])
scheduler.load_state_dict(checkpoint["scheduler_state"])
cur_itrs = checkpoint["cur_itrs"]
best_score = checkpoint['best_score']
print("Training state restored from %s" % opts.ckpt)
print("Model restored from %s" % opts.ckpt)
del checkpoint # free memory
else:
print("[!] Retrain")
model = nn.DataParallel(model)
model.to(device)`
@PytaichukBohdan thanks!
when continue training, the ASPPPooling met the error: Original Traceback (most recent call last): File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, kwargs) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/utils.py", line 16, in forward x = self.classifier(features) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 84, in forward low_output_feature= self.aspp(low_level_beforeFPM) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 265, in forward res.append(conv(x)) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 233, in forward x = super(ASPPPooling, self).forward(x) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward exponential_average_factor, self.eps) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/functional.py", line 1652, in batch_norm raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])
ASPPPooling worked when retraining
I don't know how to debug, please give some help.
when continue training, the ASPPPooling met the error: Original Traceback (most recent call last): File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, kwargs) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/utils.py", line 16, in forward x = self.classifier(features) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call* result = self.forward(input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 84, in forward low_output_feature= self.aspp(low_level_beforeFPM) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 265, in forward res.append(conv(x)) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/home/GFXX/Projects/CloudDetection/cloudNet_4channel/network/_deeplab.py", line 233, in forward x = super(ASPPPooling, self).forward(x) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call* result = self.forward(input, kwargs) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward exponential_average_factor, self.eps) File "/home/GFXX/anaconda3/envs/gfx/lib/python3.7/site-packages/torch/nn/functional.py", line 1652, in batch_norm raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])
ASPPPooling worked when retraining I don't know how to debug, please give some help.
I konw the "1" was caused by AdaptiveAvgPool2d, but why only except error in continue training?
How can my output segmentation image be the same as the second image, tks sir very much
Hello, thanks for your nice work. I met a bug on --continue training.
python main.py --model deeplabv3plus_mobilenet --dataset cityscapes --gpu_id 6 --lr 0.1 --crop_size 768 --batch_size 12 --output_stride 16 --data_root ./datasets/data/cityscapes --ckpt checkpoints/best_deeplabv3plus_mobilenet_cityscapes_os16.pth --continue_training
Can you fix it?