Open apptcom1123 opened 7 months ago
Your understanding is right. The normalized pixel loss is from the original image mae paper. If you want to use videomae for reconstruction, you could refer to the original image mae repo, where it tries other loss like GAN loss. It will give more reasonable reconstruction result.
Hi @apptcom1123, in the case of patch size 1, are you training everything from scratch? Also if you get a better reconstruction, do you also get a better accuracy during fine-tuning phase?
I attempted to use VideoMAE for video reconstruction tasks, and while the reconstructed videos seemed correct on a larger scale, they appeared blurry at each patch level.
Initially, I thought this was a normal phenomenon caused by MSELoss, as it's well-known that such a loss function can lead to detail blurriness. However, when I adjusted the Patchsize to 1, the model predicted extremely good results, with almost perfect details. At first, I thought this indicated the model's strength or that my video training data was sufficient, but I was mistaken. Even when I reduced the training video count to 20, the model still achieved very good results.
Then, I trained the model with a set of black and white, static videos. It also produced similarly good results on the test set. This seemed unreasonable, especially since the trained model was very small (around 500KB). So, I started printing out each layer to see where the problem was that led to these unreasonably good results. Eventually, I found the issue was with this line:
rec_img = rec_img * (img_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6) + img_squeeze.mean(dim=-2, keepdim=True)
img_squeeze is a rearrangement of the original photo, and multiplying rec_img by the variance and then adding the mean made the results, good or bad, very close to the original image, especially noticeable when patch_size = 1. This also explains why the boundaries between patches in the original paper were particularly mismatched.
However, this method only makes the model learn a distribution of patches, ensuring that each patch matches the value of each patch in videos_norm at every position as closely as possible.
But the good results obtained are not because of a good model architecture, but because the model can produce very good results, close to the original video, no matter how small the training dataset is.
Am I misunderstanding something?
If not, VideoMAE might not be a useful model for video reconstruction.
Brick