Training protocol (specification for training settings)

PingoLH / FCHarDNet

Fully Convolutional HarDNet for Segmentation in Pytorch

MIT License

195 stars 52 forks source link

Training protocol (specification for training settings) #15

Open yswang1717 opened 4 years ago

yswang1717 commented 4 years ago

Hi, Thanks for great contributions on the segmentation problem. Now, I'm working to follow up your results in my environments as the same as the file that you have uploaded to 'Cityscapes pretrained weights' (77.7% mIOUs on validation set ). The same settings with the 'hardnet.yml' file perform only 76.8%. (not 77.7% )

I need more details on training. Could you please tell me regarding on this?

for example,

I need # of GPUs, initial learning rate, weight decay, iterations, batch size, crop size you used for your records. (77.7%)

2.Unfortunately, I have only 4 GPUs and could you tell me the results on 4 GPUs settings if you have any please?

Have you used the GPU parallel what you answered https://github.com/PingoLH/FCHarDNet/issues/8 ? If you did, could you please the way please to get the 77.7% results ? (same as the code in the issue?)

PingoLH commented 4 years ago

Hi, here is our environment setting:

GPU: v100 32GB GPU x1
Pytorch 1.0.1 with cuda 9.2 We can also get a similar result on a two-GPU environment with v100 16GB x2 Hyperparameters (all the same as the setting in hardnet.yml): lr = 0.02 weight_decay: 0.0005 train_iters: 90000 batch_size: 16 crop size: 1024x1024

yswang1717 commented 4 years ago

Hi, here is our environment setting:

GPU: v100 32GB GPU x1

Pytorch 1.0.1 with cuda 9.2 We can also get a similar result on a two-GPU environment with v100 16GB x2 Hyperparameters (all the same as the setting in hardnet.yml): lr = 0.02 weight_decay: 0.0005 train_iters: 90000 batch_size: 16 crop size: 1024x1024

Thank you for your quick response!

1.If you run on the above setting, how much mIOUs does the FC-HarDNet achieve? Could you give me ablation study table if you have any please?

2. Have you used the fine-tune options for your training? If you did, could you tell me the details of it? (iterations and changed initial lr) 3. Have you paralleled the GPUs during training according to the reply mentioned in the issue 8?

PingoLH commented 4 years ago

Hi,

Val mIoU: 77.7, Test mIoU: 75.9.
No. One pass only
Yes. (v100 x2)

DonghweeYoon commented 4 years ago

Hi. Did you use synchronized batch normalization for training?

PingoLH commented 4 years ago

Hi, no, the model was trained with native nn.BatchNorm2d

DonghweeYoon commented 4 years ago

Thank you for answering.

I think your setting, v100 GPU 16GB x 2, is quite good.

But when I have trained the model using your codes, I got the memory imbalance in GPUs that emerged in #8 (Training unbalance on different GPUs). In my setting(8 x GTX 1080), the master GPU computing the loss is heavily overloaded when trained with lots of GPUs. For example, the memory needed for the master GPU is 7507MiB and that of the others is about 1600MiB. In this case, The maximum mini-batch size for training is 1 for each GPU. So, I couldn't imagine the batch normalization effect for the model.

Maybe many other users have suffered from this problem. I think synchronized batch normalization could be one of the solutions.

PingoLH commented 4 years ago

for that issue, I forgot to follow up that I found a custom DataParallel version form CornerNet which has a chunk_sizes argument for imbalanced batch distribution so you can better utilize the GPU memory.

DonghweeYoon commented 4 years ago

I will try it. Thank you for your fast response.