DLUT-yyc / Isomer

[ICCV2023] Isomer: Isomerous Transformer for Zero-Shot Video Object Segmentation
28 stars 1 forks source link

Training Results Question #5

Open YangSheng0511 opened 1 month ago

YangSheng0511 commented 1 month ago

Following the instruction in Isomer paper section3.3 (Implement Details), I pretrain Isomer on selected Youtubevos datatsets (1 frame from every 30 frames) for 500 epochs (as shown in code) and finetune Isomer on Davis16+FBMS dataset for more than 500 epochs, yet only achieved best 67.65 J&F on the Davis16 testset and far lower score(58.6) on the FBMS dataset. I wonder whether this was caused by a small pretraing dataset thus repeat the training process but used all Youtubevos data for pretraining, yet still only got best 66.9 J&F on DAVIS16 and 76.8 J on FBMS.

Since the results differ signifiacntly from the reported results, I wonder if I miss some details or the actual training process is different from the paper and code, which possibly led to failure to reproduce the results. I sincerely hope to get a reply, and would be very grateful for that.

DLUT-yyc commented 1 month ago

I think the final model you trained might be overfitting. During our actual training, we validate the model using the DAVIS validation set every 10 epochs. After pre-training for 500 epochs on the YouTubeVOS dataset, we select the best-performing model among the 50 test points (which on our server is usually the model at the 120th epoch) for fine-tuning. When the fine-tuning reaches the 500th epoch, the model fits relatively well on our validation curve. Although choosing a model trained for a different number of epochs might yield better performance, we did not select the best-performing model. Instead, we chose the model at the 500th epoch as the final model within an appropriate fitting range. However, training beyond 500 epochs could potentially lead to overfitting. Since results may vary across different computing devices, I recommend visualizing the validation curve to track the model's generalization and overfitting in real-time. Good luck!

YangSheng0511 commented 1 month ago

Thanks so much for your reply! I indeed chose the best-performing result during pre-training stage for following fine-tuning, which locates around 200 epochs and achieved 76.8 on the Davis16 testset. And I chose the best result among 500+ fine-tuning epochs as the final result mentioned above. However, I validated the model for every 25 epochs during fine-tuning stage, which could possibly lead to imprecise observations. May I ask what's the performance of your chosen pre-training checkpoint on Davis16 testset? This can help me verify whether I made some mistakes on codes and the training process.

Sincere gratitude for your reply again !

DLUT-yyc commented 1 month ago

Hi, the pre-trained model can achieve approximately 78 J&F accuracy on the DAVIS-16 validation set, which is roughly consistent with the results of your pre-trained model. Therefore, the issue might have occurred during the fine-tuning stage. I hope this information helps you.