Hi, I meet some problems when reproduce the results using pascal voc dataset

YanFangCS commented 2 years ago

Hi, I try to reproduce your results reported in your paper but can't reach your results as your paper report. Because of computation resource limited, I use batch size 8 and learning rate 0.0005 which are half in your paper.

When try to reproduce "full" method using the superparameters mentioned before，I just reach 77.12 (running 3 times and average). Could you give me some advice to reproduce your method, thanks. The config file I used in reproduce experiment is as following. Besides, the annotations files are get as you mentioned in your repo.

dataset: # Required.
  type: pascal_semi
  train:
    data_root: xxx/VOCdevkit/VOC2012
    data_list: xxx/U2PL/data/splits/pascal/1464/labeled.txt
    flip: True
    GaussianBlur: False
    rand_resize: [0.5, 2.0]
    #rand_rotation: [-10.0, 10.0]
    crop:
      type: rand
      size: [513, 513] # crop image with HxW size
  val:
    data_root: xxx/VOCdevkit/VOC2012
    data_list: xxx/U2PL/data/splits/pascal/val.txt
    crop:
      type: center
      size: [513, 513] # crop image with HxW size
  batch_size: 2
  n_sup: 1464
  noise_std: 0.1
  workers: 2
  mean: [123.675, 116.28, 103.53]
  std: [58.395, 57.12, 57.375]
  ignore_label: 255

trainer: # Required.
  epochs: 80
  eval_on: True
  optimizer:
    type: SGD
    kwargs:
      lr: 0.0005  # 4GPUs
      momentum: 0.9
      weight_decay: 0.0001
  lr_scheduler:
    mode: poly
    kwargs:
      power: 0.9
  unsupervised:
    TTA: False
    drop_percent: 80
    apply_aug: cutmix
  contrastive:
    negative_high_entropy: True
    low_rank: 3
    high_rank: 20
    current_class_threshold: 0.3
    current_class_negative_threshold: 1
    unsupervised_entropy_ignore: 80
    low_entropy_threshold: 20
    num_negatives: 50
    num_queries: 256
    temperature: 0.5

saver:
  snapshot_dir: checkpoints
  pretrain: ''

criterion:
  type: CELoss
  kwargs:
    use_weight: False

net: # Required.
  num_classes: 21
  sync_bn: True
  ema_decay: 0.99
  encoder:
    type: u2pl.models.resnet.resnet101
    kwargs:
      multi_grid: True
      zero_init_residual: True
      fpn: True
      replace_stride_with_dilation: [False, True, True]  #layer0...1 is fixed, layer2...4
  decoder:
    type: u2pl.models.decoder.dec_deeplabv3_plus
    kwargs:
      inner_planes: 256
      dilations: [12, 24, 36]

Haochen-Wang409 commented 2 years ago

Hi, it really confused me... We have trained several times on classic PASCAL VOC 2012 and it was very stable.

Could you provide your training log? Or, could you use 8 GPUs with batch_size=2 and change the lr to 0.001?

I'm not sure why you got unexpected results :(

YanFangCS commented 2 years ago

The training log I get is as following. In this experiment, I use the config file as mentioned before. seg_20220318_163031.txt Moreover, because of computation resource limited, I can't get 8GPUs to do this reproduce experiments, I can use 4x 3090 cards until now. Thanks for your help.

Haochen-Wang409 commented 2 years ago

The best performance is in epoch 10, which is quite wired... Did you change the random seed for 3 different running?

YanFangCS commented 2 years ago

emm, I haven't changed random seed, all experiments use the default seed 2. And parameters I have change are batch size, learning rate, training process port and dataset directory. Besides, no parameters have been changed.

Haochen-Wang409 commented 2 years ago

Hi, @YanFangCS we have retrained our model on 4 V100 with batch_size=16 and lr=0.001, here is our training log. I am not sure why you cannot reproduce the results... :(

Maybe batch_size=16 and lr=0.001 are two important parameters?

YanFangCS commented 2 years ago

thanks for your help, I will try more times to reproduce it. Thanks for your work, introducing new unreliable perspective.

YanFangCS commented 2 years ago

By the way, I am wondering why the total training iterations is 45600 when set epochs as 80. That means there exists 570 iterations per epoch, but given supervised dataset size 1464, unsupervised dataset size 9118 and batchsize of 4(16 of total). It's quite wired.

YanFangCS commented 2 years ago

emm, I see, you calculate epochs according to unsupervised dataset size. This calculation is same as AEL does.

By the way, I am wondering why the total training iterations is 45600 when set epochs as 80. That means there exists 570 iterations per epoch, but given supervised dataset size 1464, unsupervised dataset size 9118 and batchsize of 4(16 of total). It's quite wired.

Haochen-Wang409 commented 2 years ago

Yes, an epoch is defined as the iterations that the model is trained by all unsupervised images.

YanFangCS commented 2 years ago

I have reproduced the result as paper declares. I solve this problem by using batch size 16, lr 0.001 with torch.cuda.amp, which is similar to apex. It consumes about 15G cuda memory for RTX3090 with amp which is affordable. So, I think batch size and lr are essential for reproducing this paper. The model can't be sucessfully trained with half bs and lr, it still confuses me a lot. Thanks for your help.

Haochen-Wang409 / U2PL

Hi, I meet some problems when reproduce the results using pascal voc dataset #15