Open ngoductuanlhp opened 5 months ago
Hi @ngoductuanlhp, I don't think there could be such a big gap due to mismatched library versions.
We either train it on 32 GPUs for 50k iterations or on 8 GPUs for 200k. I obtained similar performance with both settings, but 32 GPUs is slightly better. So, have you tried to train the model on 8 GPUs for 200k without gradient accumulations?
Also, how do you evaluate the model?
I haven't tried training the model with 200k iterations without gradient accumulations. But I did train the model with 50k iterations on 8gpus with the same learning rate of 0.0005 and the results are not good.
I use your evaluate script to evaluate on Tapvid-davis first/strided, and the dynamic replica validation.
Hi @nikitakaraevv,
Thank you for your excellent work.
I have a question regarding the training pipeline. I'm currently trying to reproduce the results in Table 3 of your paper. When I trained the model from scratch on the Kubric dataset, the best evaluation result on the Tapvid Davis dataset is as follows:
"occlusion_accuracy": 0.8503666396802487 "average_jaccard": 0.5575681919643163 "average_pts_within_thresh": 0.7087581437592014 These results are significantly lower than those obtained with your provided checkpoint. I'm using Torch 2.1.0 with CUDA 12.3, and trained the model on 8 A100 GPUs with 200000 iterations, and accumulate gradient of 4 to mimic your setting.
Do you think the issue could be due to mismatched library versions, or might I be missing something else? I appreciate any guidance you can provide.
Thank you.