Re-train learning rate and epoch schedule

XuelianCheng / LEAStereo

Hierarchical Neural Architecture Searchfor Deep Stereo Matching (NeurIPS 2020)

MIT License

256 stars 51 forks source link

Re-train learning rate and epoch schedule #18

Open alekseynp opened 3 years ago

alekseynp commented 3 years ago

I'm look at trying to re-produce your results. I see that none of the train_*.sh scripts include a learning rate and therefore obtain_train_args() will give 0.001 by default. However, I see in all the the released checkpoints:

checkpoint['optimizer']['param_groups'][0]['initial_lr']
0.0001
checkpoint['optimizer']['param_groups'][0]['lr']
1.25e-05

Is this just a small oversight? Might there be any other differences?

alekseynp commented 3 years ago

Upon further investigation of the checkpoints I find the following:

# MiddEval3_best.pth
checkpoint['state_dict']['module.feature.stem0.bn.num_batches_tracked']
tensor(252038, device='cuda:0')
checkpoint['epoch']
444

# sceneflow_best.pth
checkpoint['state_dict']['module.feature.stem0.bn.num_batches_tracked']
tensor(106358, device='cuda:0')
checkpoint['epoch']
2

With 35,454 entries in sceneflow_train.list and --batch_size=4 I would expect 35454//4 = 8863 batches per epoch and therefore 8863*2 = 17726 batches tracked after 2 epochs. Note 106358 / 17726 = 6.000112829

With 15 entries in middeval3_train.list and --batch_size=2 I would expect 15//2 = 7 batches per epoch and therefore 7 * 444 = 3108 batches tracked possibly on top of the sceneflow checkpoint?

So I would guess there are also some oversights in the epochs/LR schedule of the re-training code.

ShichenLiu commented 3 years ago

Hi,

Have you found out the learning rate schedule? I re-trained the model for 20 epochs with a constant learning rate of 0.001 but couldn't reproduce the results, where I only got 1.03 EPE and a 10.1% error rate (far from the reported numbers). So this makes me doubt that the model used in the paper was actually trained with more than 20 epochs.

Also, the code makes use of dataparallel modules, so it is also possible that the model was trained with multiple GPUs. My guess is that: 1) with 2 V100 gpus , which leads to 24 epochs in total (maybe with a learning rate decay at 20); 2) with 4 V100 gpus, which leads to 48 epochs in total;

Update: I was talking about the sceneflow dataset.

alekseynp commented 3 years ago

My training setup is a bit different. I have dramatically refactored the code and made other changes that are not worth discussing. So far I have been able to mostly reproduce their Middlebury results with this schedule:

batch size 4 (2 per 2 gpus)
lr 0.001
gamma 0.5
milestones 800, 1600, 2400, 3200, 4000
max epochs 4800

I would be very interested in a schedule that is known to reproduce the SceneFlow results!