aimagelab / STAGE_action_detection

Code of the STAGE module for video action detection
49 stars 12 forks source link

Actor feature #5

Closed tsminh closed 4 years ago

tsminh commented 4 years ago

Can you tell more about how you get the actor features ?

"features" -> a torch tensor with shape (num_actors, feature_size, t, h, w) containing actors features

So that means first we have to detect each actors right ? then I will turn each actors in a numpy array (1, 16, 224, 224) (16 frames and image size 224x224 according to I3D repo). and put in into I3d to get the feature. As far as I know, i can get the feature at the layer Mixed_5c with size (1, x, y, z, 1024). I read some issues and it tells that if the longer video, the bigger x. also the bigger video size, the bigger y, z. Then I follow this to turn (1, x, y, z, 1024) into (1024, ) So I wonder what t, h, w are (I guess time, height, width ? 😆 ). Please correct me if i am wrong.

Thank you for taking your time ^^

matteot11 commented 4 years ago

Hi! Thanks for pointing out this, maybe I haven't been clear enough.

STAGE's input consists of features extracted from a pre-trained backbone, which means that you need a backbone able to: 1) Take raw video clip as input 2) Detect people 3) Output a feature vector for each actor found in the clip

I3D is a video-classification backbone, so you need a separate detector and a RoIPooling (or RoIAlign, or similar) operation at an intermediate layer (you can also crop directly on the RGB frames, if you prefer, but I have not tried it) to get the actor's features. What STAGE expects as input is, for each clip, a _(num_actors, featuresize, t, h, w) tensor, where t,h and w will be 1 if you have already done an average pooling on these dimensions while extracting features, otherwise STAGE will do the average (if you have not averaged yet in space and time). Features I shared in this repo are _(numactors, 1024, 1, 7, 7) since I averaged in time but not in space during feature extraction. But the first operation done by STAGE will be to average in t, h and w, so that is not a problem. Hoping to have clarified a bit.

Matteo

tsminh commented 4 years ago

Hi Matteo, i appreciate your quick response. it means a lot to me.

But there are some problems that i can't figure out.

Sorry, i know that maybe i ask some silly question.

It would be great if you have more details how you extract the actor feature. I also read the paper but it s still over my head (😆 ).

Thank you for taking your time.

matteot11 commented 4 years ago

Don't worry! If you read section 4 of the paper (at the end of subsection "Backbones setup"), you can find:

"Features always come from the last layer of the backbone before classification, after averaging in space and time dimensions: feature size is 1024 and 2048 for I3D and R101-I3D-NL respectively"

which means that, after RoIPooling at Mixed 4f, each actor is forwarded through the remaining layers of I3D (in our case) till the last layer before classification. So the 1024 dimension of our features. But nobody forces you to do so, you can try to use features coming from any layer of the backbone (be careful to change the Linears' in_features and out_features accordingly in the code).

Let me know if you have any other doubts. Matteo

tsminh commented 4 years ago

yeah i get that part.

What i am facing right now is that i only get the an output for the whole video not for each actor, that s why i came up with an idea that crop out each actor (rgb) then put into I3D to get the feature for each.

but i still want to try to do your way that using RoiAlign. I dont know how to do that. How do you implement that step. Also at that step, the output 's size is different from the original image 's, how will you use the roi detected before that step ?

matteot11 commented 4 years ago

I know, I3D is just for video classification, you need to change it a bit. The steps are the following: -Forward the whole video through I3D up to Mixed 4f, to obtain a feature map (1, CH, T, H, W), if we consider a batch size of 1 clip. T, H and W depend on the input clip time, height, width. -Use the actors' boxes from the detection network (scaled based on the feature map's spatial shape) to RoiPool (RoIAlign) the feature map obtained at the previous step; -After RoIPool (RoIAlign), you will have a feature map of the same shape for each actor (which depends on the output shape you choose for RoIPool/Align). So from a tensor (1, CH, T, H, W) you will have a tensor (num_actors, CH, T, H', W'). Note that RoIPool needs (BS, CH, H, W) tensor as input, so you need to RoIPool each frame independently and then concatenate the output on the time axis. Here, I suppose the same boxes for an actor through time (straight Spatio-temporal tube). -Forward each actor through the last layers of I3D.

Note: I suggest you to look here (https://github.com/pytorch/vision/blob/master/torchvision/ops/roi_pool.py) the RoIPool's arguments. If you are using a batch size > 1, you need to specify to RoIPool which element of the batch each box belongs to (input boxes have 5 coordinates, where the first one is the index of the clip in the batch).

Let me know if this helps you. Matteo

tsminh commented 4 years ago

It s an awesome answer @matteot11 .

I just realize that:

Thank you so much for making it clear to me.

I appreciate your kindness Matteo, for spending time to help me with this.

I am spending on this more to understand it more clearly. Thank you so much for your help and your amazing work.

Best regards

Minh

matteot11 commented 4 years ago

I'm glad to have helped you in some way. Do not hesitate to ask if you have any other doubts!

Matteo