Sense-X / UniFormer

[ICLR2022] official implementation of UniFormer
Apache License 2.0
812 stars 111 forks source link

spatiotemporal behavior detection #107

Closed yan-ctrl closed 1 year ago

yan-ctrl commented 1 year ago

Hello, thank you for your work. I would like to ask you how to apply this work to the AVA dataset and do spatiotemporal behavior detection.

Andy1621 commented 1 year ago

Sorry, I have not run AVA. However, I think you can follow VideoMAE to run it. They forked AlphaAction to run AVA. Just copying the model and reusing their repo!

yan-ctrl commented 1 year ago

good Thank you for your recommendation, but I'm afraid I don't have enough GPUs to run video MAE.

Andy1621 commented 1 year ago

Yes. My suggestion is that you can copy the UniFormer model to run it. Just like how to use MMDetection/MMsegmentation...

yan-ctrl commented 1 year ago

Oh, you mean let me pre-train the model in your work or VideoMAE, and then fine-tune my own model in the AlphAction library.

Andy1621 commented 1 year ago

Yes. The above repo is based on AlphAction and you can reuse their hyperparameters for transformer-based models. If you want to use UniFormer or other efficient backbones, you can transfer your model code to that repo like here (you may need to add ROIPooling).

yan-ctrl commented 1 year ago

Thank you for your patience. But I still have questions for you https://github.com/MCG-NJU/VideoMAE-Action-Detection , although AlphAction is used, the pre-training models provided are based on ViT, and SlowFast is not used as the backbone network, so:

1) The function of AlphAction is just to detect the head, right? All models need to be based on ViT.

2) VideoMAE-Action-Detection/modeling_ Finetune.py, if I use other backbones, how can I conduct MAE training.

Andy1621 commented 1 year ago
  1. The repo is used for training an Action Detection model based on Kinetics-pretrained models.
  2. Your original problem is how to apply UniFormer to the AVA dataset. In my opinion, you can reuse their repo, and add UniFormer model. Why do you want to conduct MAE training?
yan-ctrl commented 1 year ago

Well, because I want to apply it to my own tasks, I need to customize data sets similar to AVA format, so labeling is troublesome. I can't label data sets as large as AVA. I want to see if self-supervised learning can help. So, as you said, I can train the parameters of the backbone network based on the Kinetics data set and migrate to the downstream tasks. But MAE is based on ViT as the backbone network, and your work is also a good backbone network, So I asked you if AlphAction only plays the role of using motion detection to evaluate the MAE model, like the classifier of image segmentation network, or as you said, using the UniFormer model, and then reuse AlphAction repo.

Andy1621 commented 1 year ago

Q: "So I asked you if AlphAction only plays the role of using motion detection to evaluate the MAE model." A: AlphAction is a general codebase for training action detection models. It's not only used for the MAE model. You can use other models as backbones.

yan-ctrl commented 1 year ago

Ok, thank you for your patience. I see