I find that there seems to be some strange things in the evaluation of model.

OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

https://arxiv.org/abs/2303.16727

MIT License

524 stars 63 forks source link

I find that there seems to be some strange things in the evaluation of model. #23

Closed leexinhao closed 1 year ago

leexinhao commented 1 year ago

I did some simple finetuning training and it seems that some of it looks normal : A1RSV8H}1XR1PN4~0`K@NFF

But when I retested the saved pt file with the '--eval' parameter, I got slightly different results: in particular, the results of the single view test were quite different (65.xx vs. 67.xx):

Is this normal or a bug? Is there something wrong with my understanding?

congee524 commented 1 year ago

65.748 is not the single-view result, it is the average precision of all clips (when testing, each video has 5x3 views). In validation, we always take the middle clip of the video, so the accuracy(67.0) is higher.

congee524 commented 1 year ago

The inconsistent results when reloading ckpt are due to the run-time parameter, such as running_mean & running_var in normalization layers.

During training, each model on every GPU has its own run-time parameters. However, we only save the checkpoint on GPU0. When reloading the model, the run-time parameters of the model on other GPUs are loaded from the model on GPU0, leading to slight differences in the results.

leexinhao commented 1 year ago

consistent results when reloading ckpt are due to the run-time parameter, such as running_mean & running_var in normalization layers.

During training, each model on every GPU has its own run-time parameters. However, we only save the checkpoint on GPU0. When reloading the model, the run-time parameters of the model on other GPUs are loaded from the model on GPU0, leading to slight differences in

As far as I know, VideoMAE has no BatchNorm so we needn't synchronize the running_mean & running_var. Are there any other parameters that need to be synchronized? And can we avoid this problem？

congee524 commented 1 year ago

Another possibility is inconsistent batch sizes. When the testing data cannot be evenly divided by the batch size, the last batch will randomly select some videos to fill in.

leexinhao commented 1 year ago

I can't find that in your code. Can you show me the location of the corresponding implementation?

congee524 commented 1 year ago

https://github.com/OpenGVLab/VideoMAEv2/blob/9492db0047a9e30446a4093543a1a39dfe62b459/run_class_finetuning.py#L439-L442

This code is modified from DeiT, and I haven't looked closely at how it's handled, but it should be related to the sampler.

leexinhao commented 1 year ago

I see, thank you!