Open alekseynp opened 3 years ago
Upon further investigation of the checkpoints I find the following:
# MiddEval3_best.pth
checkpoint['state_dict']['module.feature.stem0.bn.num_batches_tracked']
tensor(252038, device='cuda:0')
checkpoint['epoch']
444
# sceneflow_best.pth
checkpoint['state_dict']['module.feature.stem0.bn.num_batches_tracked']
tensor(106358, device='cuda:0')
checkpoint['epoch']
2
With 35,454 entries in sceneflow_train.list and --batch_size=4
I would expect 35454//4 = 8863 batches per epoch and therefore 8863*2 = 17726 batches tracked after 2 epochs. Note 106358 / 17726 = 6.000112829
With 15 entries in middeval3_train.list and --batch_size=2
I would expect 15//2 = 7 batches per epoch and therefore 7 * 444 = 3108 batches tracked possibly on top of the sceneflow checkpoint?
So I would guess there are also some oversights in the epochs/LR schedule of the re-training code.
Hi,
Have you found out the learning rate schedule? I re-trained the model for 20 epochs with a constant learning rate of 0.001 but couldn't reproduce the results, where I only got 1.03 EPE and a 10.1% error rate (far from the reported numbers). So this makes me doubt that the model used in the paper was actually trained with more than 20 epochs.
Also, the code makes use of dataparallel modules, so it is also possible that the model was trained with multiple GPUs. My guess is that: 1) with 2 V100 gpus , which leads to 24 epochs in total (maybe with a learning rate decay at 20); 2) with 4 V100 gpus, which leads to 48 epochs in total;
Update: I was talking about the sceneflow dataset.
My training setup is a bit different. I have dramatically refactored the code and made other changes that are not worth discussing. So far I have been able to mostly reproduce their Middlebury results with this schedule:
I would be very interested in a schedule that is known to reproduce the SceneFlow results!
I'm look at trying to re-produce your results. I see that none of the train_*.sh scripts include a learning rate and therefore
obtain_train_args()
will give 0.001 by default. However, I see in all the the released checkpoints:Is this just a small oversight? Might there be any other differences?