MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.39k stars 137 forks source link

VideoMAE might not be a useful model for video reconstruction? Or perhaps it only learns the most generic distribution within patches? #120

Open apptcom1123 opened 7 months ago

apptcom1123 commented 7 months ago

I attempted to use VideoMAE for video reconstruction tasks, and while the reconstructed videos seemed correct on a larger scale, they appeared blurry at each patch level.

Initially, I thought this was a normal phenomenon caused by MSELoss, as it's well-known that such a loss function can lead to detail blurriness. However, when I adjusted the Patchsize to 1, the model predicted extremely good results, with almost perfect details. At first, I thought this indicated the model's strength or that my video training data was sufficient, but I was mistaken. Even when I reduced the training video count to 20, the model still achieved very good results.

Then, I trained the model with a set of black and white, static videos. It also produced similarly good results on the test set. This seemed unreasonable, especially since the trained model was very small (around 500KB). So, I started printing out each layer to see where the problem was that led to these unreasonably good results. Eventually, I found the issue was with this line:

rec_img = rec_img * (img_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6) + img_squeeze.mean(dim=-2, keepdim=True)

img_squeeze is a rearrangement of the original photo, and multiplying rec_img by the variance and then adding the mean made the results, good or bad, very close to the original image, especially noticeable when patch_size = 1. This also explains why the boundaries between patches in the original paper were particularly mismatched.

However, this method only makes the model learn a distribution of patches, ensuring that each patch matches the value of each patch in videos_norm at every position as closely as possible.

But the good results obtained are not because of a good model architecture, but because the model can produce very good results, close to the original video, no matter how small the training dataset is.

Am I misunderstanding something?

If not, VideoMAE might not be a useful model for video reconstruction.

Brick

wanglimin commented 7 months ago

Your understanding is right. The normalized pixel loss is from the original image mae paper. If you want to use videomae for reconstruction, you could refer to the original image mae repo, where it tries other loss like GAN loss. It will give more reasonable reconstruction result.

pooyafayyaz commented 4 months ago

Hi @apptcom1123, in the case of patch size 1, are you training everything from scratch? Also if you get a better reconstruction, do you also get a better accuracy during fine-tuning phase?