Does the repo pick the weights that perform best in val dataset to evaluate in test dataset?

pierowu commented 1 year ago

Thank you for your solid work. Does the repo implement the function that pick the model weights that perform best in val dataset to evaluate in test dataset? From the code below, it seems that the repo directly choose the best results in test dataset as the final results? https://github.com/Computer-Vision-in-the-Wild/Elevater_Toolkit_IC/blob/00d0af78559d5f3d800ae4668210e6bd1f2f84b9/vision_benchmark/evaluation/full_model_finetune.py#L267-L277

ChunyuanLI commented 1 year ago

Thanks for the careful study.

When selecting the hyper-parameters, we sweep and validate on validation dataset:

https://github.com/Computer-Vision-in-the-Wild/Elevater_Toolkit_IC/blob/00d0af78559d5f3d800ae4668210e6bd1f2f84b9/vision_benchmark/evaluation/full_model_finetune.py#L173

Once the best hyper-parameter is selected, and report the best results on the test set, the line of code indicates that the best weights /epoch are selected by looking at the best numbers on test set.

For the real applications where we are not allowed to leverage the labels in the test set to determine the training epochs, we recommend to fix the numbers of epochs in the hyper-parameter search stage.

pierowu commented 1 year ago

Thanks for the careful study.

When selecting the hyper-parameters, we sweep and validate on validation dataset:

https://github.com/Computer-Vision-in-the-Wild/Elevater_Toolkit_IC/blob/00d0af78559d5f3d800ae4668210e6bd1f2f84b9/vision_benchmark/evaluation/full_model_finetune.py#L173

Once the best hyper-parameter is selected, and report the best results on the test set, the line of code indicates that the best weights /epoch are selected by looking at the best numbers on test set.

For the real applications where we are not allowed to leverage the labels in the test set to determine the training epochs, we recommend to fix the numbers of epochs in the hyper-parameter search stage.

Thank you for your reply. Would this way cause overfitting in the test set? For example, we can design a model with high variance which can perform well in serveral epochs in the test set. However, this model will deteriorate in other epochs. When I want to compare with other methods in elevater benchmark, how can I make the comparison fair?

Computer-Vision-in-the-Wild / Elevater_Toolkit_IC

Does the repo pick the weights that perform best in val dataset to evaluate in test dataset? #14