facebookresearch / SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Apache License 2.0
6.53k stars 1.21k forks source link

MAE Spatiotemporal Learners (Garbage Results) #668

Closed innat closed 6 months ago

innat commented 1 year ago

Background

I was following the interesting work done by facebookresearch on Masked Autoencoders As Spatiotemporal Learners which later added to SlowFast/mae_st. With the provided pretrained weight, I tried to call inference on series of video frame (a smple from kinetric400), however, the results with the pretrained model gives totally nonsene prediction.

image

Reproducible Code

Please find the gist here. COLAB

Info

innat commented 1 year ago

cc. @haooooooqi @feichtenhofer

innat commented 1 year ago

Hi @r-barnes I've just noticed some update on mae_st repo. By any chance, could you please inform if the provided weights are in good health?

alpargun commented 7 months ago

Looking at the collab, you have a long list of missing keys, i.e., not restored layers, when you load the checkpoint and state dictionary:

image image

So, you are running inference with random weights.

The vision models in SlowFast, timm, MAE_ST, torchhub, etc., do not necessarily (very rarely) have perfectly matching implementations. Mostly the layer names, i.e., state dictionary keys, will differ in name. So, I would suggest you to run inference using directly the SlowFast framework, including the corresponding config file, and class implementations.

If you have troubles restoring the model, I can provide more information or code snippets.

innat commented 7 months ago

Hello @alpargun , I stopped working on it long ago and forgot to close this ticket. I emailed to one of the authors but no response so far. However, if you are able to fix the colab issue, and fix cost is minimal, please go ahead. I highly appreciate your help. Thank you.