Loss becomes NaN - HRNetV2+C1

AndrejHafner commented 4 years ago

Hello!

I have a dataset of images of labels with two classes. I have checked that the labels have correct labels (values 1 and 2). When training a NN with HRNetV2 for encoder and C1 for decoder I start getting loss of nan at the end of first epoch. After that it doesn't correct and the predictions are unusable. I have tried reducing learning rate to 1e-7, but I still get the same problem. I had this problem with other encoder-decoder combinations, but it usually started way later and lowering the learning rate pretty much solved it (with resnet101 + upernet i get mIoU of 0.92).

Here is my config:

DATASET:
  root_dataset: ""
  list_train: "./data/sclera_train.odgt"
  list_val: "./data/sclera_validation.odgt"
  num_class: 2
  imgSizes: (300, 375, 450)
  imgMaxSize: 512
  padding_constant: 32
  segm_downsampling_rate: 4
  random_flip: False

MODEL:
  arch_encoder: "hrnetv2"
  arch_decoder: "c1"
  fc_dim: 720

TRAIN:
  batch_size_per_gpu: 4
  num_epoch: 20
  start_epoch: 0
  epoch_iters: 5000
  optim: "SGD"
  lr_encoder: 0.000001
  lr_decoder: 0.000001
  lr_pow: 0.9
  beta1: 0.9
  weight_decay: 1e-4
  deep_sup_scale: 0.4
  fix_bn: False
  workers: 16
  disp_iter: 20
  seed: 304

VAL:
  visualize: False
  checkpoint: "epoch_10.pth"

TEST:
  checkpoint: "epoch_10.pth"
  result: "./result-hrnetv2/"

DIR: "ckpt/sclera-hrnetv2-c1"

Did anyone else face this problem with any of the combinations?

Thank you!

EDIT: I'm having the same problems with resnet101dilated + ppm_deepsup combinations, only that it starts later.

JulianJuaner commented 4 years ago

I have the same problem! I have tried to use HrNetW18+C1 according to the settings in https://github.com/HRNet/HRNet-Semantic-Segmentation/blob/master/experiments/cityscapes/seg_hrnet_w18_small_v2_512x1024_sgd_lr1e-2_wd5e-4_bs_12_epoch484.yaml. The problem solved when I applied for this lighter model setting. But it still does not work in the original setting.

JulianJuaner commented 4 years ago

I think I solve this problem by canceling the comment line 106 in ./models/model.py. Since I am training the model from scratch, I didn't enable the [pretrain] setting.net_encoder.apply(ModelBuilder.weights_init)

AndrejHafner commented 4 years ago

I later solved the problem by lowering the learning rate to 0,00002 and learning on a GPU with more VRAM, which enabled me to increase the batch size. This increases the stability while learning. I had these problems on Nvidia GTX 980 Ti with 6GB VRAM, but they didn't appear on Nvidia P100 with 16GB of VRAM.

CSAILVision / semantic-segmentation-pytorch

Loss becomes NaN - HRNetV2+C1 #212