PIC4SeR / AcT

Official code for "Action Transformer: A Self-attention Model for Short-time Pose-based Human Action Recognition", Pattern Recognition (2022).
https://www.sciencedirect.com/science/article/pii/S0031320321006634
67 stars 15 forks source link

POS_EMB #15

Closed polarbear55688 closed 1 month ago

polarbear55688 commented 1 month ago

Hello, when I was looking at the config.yaml file, I saw that POS_EMB on line 54 was commented out. I wanted to know how the pos_emb.npy file was formed.

polarbear55688 commented 1 month ago

There is another question. I plan to input my own data set (rgb video) into this model to run. In addition to converting the data to 30fps first, do any corrections need to be made?

polarbear55688 commented 1 month ago

The last question. The paper mentions using a method similar to vit to cut the image into multiple patches and throw them into the model. However, the patch size in line 52 of the config.yaml configuration is annotated. I would like to know your patch division. Is the principle of treating each frame of a movie as a patch? Still, the image of each frame is cut into a fixed size and used as a patch. If this is the case, why is the value of patch not defined? Sorry to bother you with so many questions.

simoneangarano commented 1 month ago

Hi @polarbear55688 Let me try to answer your questions:

Hello, when I was looking at the config.yaml file, I saw that POS_EMB on line 54 was commented out. I wanted to know how the pos_emb.npy file was formed.

That file was part of a new development about a smarter positional embedding, but ultimately, we decided to discard that idea. Just ignore it

There is another question. I plan to input my own data set (RGB video) into this model to run. In addition to converting the data to 30fps first, do any corrections need to be made?

There are no other corrections needed. As a remark, the AcT model takes human poses (skeletal data) as input, so you need to process your videos to extract human poses before using AcT.

The last question. The paper mentions using a method similar to vit to cut the image into multiple patches and throw them into the model. However, the patch size in line 52 of the config.yaml configuration is annotated. I would like to know your patch division. Is the principle of treating each frame of a movie as a patch? Still, the image of each frame is cut into a fixed size and used as a patch. If this is the case, why is the value of patch not defined? Sorry to bother you with so many questions.

Yeah, that shouldn't be commented out, but we found out that 1 as patch dimension works best, so nothing changes as it's the default value.

Hope this helps!