Open wkbian opened 4 months ago
Hi @wkbian, it is normal that the training loss is a bit noisy. Can you run the evaluation on CVO to properly evaluate the performance of the final model? For example:
python test_cvo.py --split final --refiner_path checkpoints/YOUR_RUN/last.pth
I have found a bug in the code with the distributed training mode: all the GPUs were sampling the same elements of the dataset simultaneously. The issue is solved in https://github.com/16lemoing/dot/commit/cdee971fb0615fe3bf7b6fd19d856ea572327ec1 .
Also setting the flag --lambda_motion_loss 1000
when training improves a bit motion prediction quality but degrades a bit visibility prediction. This is what we use in our final method.
Hi, @16lemoing,
Congratulations on your paper acceptance! :tada:
I encountered some problems while reproducing your training results. I followed the instructions in training section. Seems the motion loss was not convergent while I set
world_size = 4
which aligns with the setting in the paper. "DOT is trained on frames at resolution 512×512 for 500k steps with the ADAM optimizer [32] and a learning rate of 10−4 using 4 NVIDIA V100 GPUs."Could you please provide some suggestions? thx~