Open rsbbb95 opened 4 years ago
Thank you for your great feedback! I have only tested using apex. I installed the python-only version of apex from here https://github.com/NVIDIA/apex . Yes occasionally, the loss may turn to nan. Were you able to start the next epoch or did it crash?
A side benefit of using nvidia apex is that the dynamic loss scaling feature for fp16 automatically skips apply back prop for that batch if the gradient or loss overflows. To use apex and multiple gpus with pytorch lightning, you may try ddp + amp.
Alternatively:
Loss going to nan is usually preceded by gradients becoming large. You can set the gradient_clip_val parameter in the trainer. See https://williamfalcon.github.io/pytorch-lightning/Trainer/Training%20Loop/#gradient-clipping
I checked tensorboard, here it is: Actually validation loss became nan After 34 epoch, and checkpoint was saved as 35 epoch, but that time I've trained model about 50 epochs.
I will install python-only apex and set gradient_clip_val to try to fix it.
Thanks for your reply!
There is an another problem, when I installed apex and run the code. Set use_amp = True, distributed_backend='ddp' There is an error:
Traceback (most recent call last):
File "traindamageunet.py", line 206, in <module>
trainer.fit(model)
File "/home/.conda/envs/2unet/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File "/home/.conda/envs/2unet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/.conda/envs/2unet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/.conda/envs/2unet/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/.conda/envs/2unet/lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 150, in ddp_train
self.optimizers, self.lr_schedulers = self.init_optimizers(model.configure_optimizers())
File "/home/xview2unet/traindamageunet.py", line 174, in configure_optimizers
return torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
NameError: name 'model' is not defined
Looks like a wired error, I didn't modify other codes. Could you fix that? I can't handle it.
Ok try replacing line 174 in traindamagunet.py
with torch.optim.SGD(self.parameters(), lr=0.01, momentum=0.9)
. If that doesn't work try self.xviewmodel.parameters()
Also the validation loss error is due to encountering a batch of images without any buildings in any of the images. However training should still continue and you should not encounter that too often if you increase validation batch size. I forgot to include the script I used to filter out empty images with no buildings.
Thanks again for your feedback.
The code has been updated to filter out empty images from training set (empty images is around 40% of the original training set) when running python preprocess.py
. As a result, training should be quicker and more stable.
Thanks for you updating the code! I've run the code successfully. Here I have another question want to ask: as far as I know, last layer of segmentation networks will output n channels (the same as n classes), but your model output 4 channels for localization(binary classification), 8 channels for damage classification(back ground, un-classified, no-damage, minor-damage, major-damage, destroyed). If 8 and 4 subtract 3(may be RGB?) respectively, we can get right number of classes without background we want. Would you mind explain the reason you do so? Is there some tricks in this?
Great! The competition damage task is to predict one out of 5 classes per pixel, and the localisation competition task is to predict one out of 2 classes per pixel. The extra channels in the last layer is so that you can predict your own auxiliary task ( e.g. predicting edges of buildings, distance to the center of the building instance, maybe even type of disaster damage: flooding, fire, tornado, hurricane, wildfire). The auxiliary task can be used for postprocessing the original task or for weighting the loss to learn instances better. The current code does not handle very large buildings well because it sometimes predicts different classes for different sections of the same building (if you look in results/vizpredictions
) . I did not have time to reliably train auxiliary tasks, but you can try and improve the network by doing this.
Thanks for your code! I'm a newbie of deep learning. I've finished localization part but when I trained train traindamageunet.py, after 50 epochs, the loss became nan. I set num_workers=8, batch_size = 8, make 'use_amp=False' (cause I don't install Apex) and set 'distributed_backend='dp'' I used 2 RTX 2080. Did you have this situation? I used same trainer setting as trainloc, but 'loss=nan' problem only came out when I train damage segmentation.