'loss=nan' problem - Githubissues

rsbbb95 commented 4 years ago

Thanks for your code! I'm a newbie of deep learning. I've finished localization part but when I trained train traindamageunet.py, after 50 epochs, the loss became nan. I set num_workers=8, batch_size = 8, make 'use_amp=False' (cause I don't install Apex) and set 'distributed_backend='dp'' I used 2 RTX 2080. Did you have this situation? I used same trainer setting as trainloc, but 'loss=nan' problem only came out when I train damage segmentation.

canktech commented 4 years ago

Thank you for your great feedback! I have only tested using apex. I installed the python-only version of apex from here https://github.com/NVIDIA/apex . Yes occasionally, the loss may turn to nan. Were you able to start the next epoch or did it crash?

A side benefit of using nvidia apex is that the dynamic loss scaling feature for fp16 automatically skips apply back prop for that batch if the gradient or loss overflows. To use apex and multiple gpus with pytorch lightning, you may try ddp + amp.

Alternatively:

Loss going to nan is usually preceded by gradients becoming large. You can set the gradient_clip_val parameter in the trainer. See https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#gradient-clipping

rsbbb95 commented 4 years ago

I checked tensorboard, here it is: 2020-01-05 20-32-51 的屏幕截图 Actually validation loss became nan After 34 epoch, and checkpoint was saved as 35 epoch, but that time I've trained model about 50 epochs.

I will install python-only apex and set gradient_clip_val to try to fix it.

Thanks for your reply!

rsbbb95 commented 4 years ago

There is an another problem, when I installed apex and run the code. Set use_amp = True, distributed_backend='ddp' There is an error:

Traceback (most recent call last):
  File "traindamageunet.py", line 206, in <module>
    trainer.fit(model)
  File "/home/.conda/envs/2unet/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/.conda/envs/2unet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/.conda/envs/2unet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/.conda/envs/2unet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/.conda/envs/2unet/lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 150, in ddp_train
    self.optimizers, self.lr_schedulers = self.init_optimizers(model.configure_optimizers())
  File "/home/xview2unet/traindamageunet.py", line 174, in configure_optimizers
    return torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
NameError: name 'model' is not defined

Looks like a wired error, I didn't modify other codes. Could you fix that? I can't handle it.

canktech commented 4 years ago

Ok try replacing line 174 in traindamagunet.py with torch.optim.SGD(self.parameters(), lr=0.01, momentum=0.9) . If that doesn't work try self.xviewmodel.parameters()

Also the validation loss error is due to encountering a batch of images without any buildings in any of the images. However training should still continue and you should not encounter that too often if you increase validation batch size. I forgot to include the script I used to filter out empty images with no buildings.

Thanks again for your feedback.

canktech commented 4 years ago

The code has been updated to filter out empty images from training set (empty images is around 40% of the original training set) when running python preprocess.py . As a result, training should be quicker and more stable.

rsbbb95 commented 4 years ago

Thanks for you updating the code! I've run the code successfully. Here I have another question want to ask: as far as I know, last layer of segmentation networks will output n channels (the same as n classes), but your model output 4 channels for localization(binary classification), 8 channels for damage classification(back ground, un-classified, no-damage, minor-damage, major-damage, destroyed). If 8 and 4 subtract 3(may be RGB?) respectively, we can get right number of classes without background we want. Would you mind explain the reason you do so? Is there some tricks in this?

canktech commented 4 years ago

Great! The competition damage task is to predict one out of 5 classes per pixel, and the localisation competition task is to predict one out of 2 classes per pixel. The extra channels in the last layer is so that you can predict your own auxiliary task ( e.g. predicting edges of buildings, distance to the center of the building instance, maybe even type of disaster damage: flooding, fire, tornado, hurricane, wildfire). The auxiliary task can be used for postprocessing the original task or for weighting the loss to learn instances better. The current code does not handle very large buildings well because it sometimes predicts different classes for different sections of the same building (if you look in results/vizpredictions ) . I did not have time to reliably train auxiliary tasks, but you can try and improve the network by doing this.

canktech / xview2unet

'loss=nan' problem #1