dingfengshi / TriDet

[CVPR2023] Code for the paper, TriDet: Temporal Action Detection with Relative Boundary Modeling
MIT License
160 stars 13 forks source link

How to test video #14

Closed syg1996419 closed 1 year ago

syg1996419 commented 1 year ago

How to test video

dingfengshi commented 1 year ago

Hi! If you want to use video as input, you need to use a pretrained video recognition network (e.g. SlowFast, TSP) to extract the temporal features of the video first, then feed the feature squence into Tridet.

EmreOzkose commented 1 year ago

Hi @dingfengshi, thanks for sharing your excellent work.

Can I use this model for slowfast? If not, which model did you use?

dingfengshi commented 1 year ago

Hi @dingfengshi, thanks for sharing your excellent work.

Can I use this model for slowfast? If not, which model did you use?

Hi, it seems OK if you use this code. I guess you can remove the classification head with model.blocks[-1]=torch.nn.Identity() and extract the feature for each snippet.

EmreOzkose commented 1 year ago

Great, I will try today. I can compare features from torchvideo with features from actionformer repo to ensure they are same/close.

EmreOzkose commented 1 year ago

when I use model.blocks[-1]=torch.nn.Identity(), output shape is torch.Size([1, 2304, 1, 2, 2]). I think it should be model.blocks[6].proj = torch.nn.Identity(), so output shape is torch.Size([1, 2304]). The last block includes a pooling layer :

(6): ResNetBasicHead(
  (dropout): Dropout(p=0.5, inplace=False)
  (proj): Identity()
  (output_pool): AdaptiveAvgPool3d(output_size=1)
)
EmreOzkose commented 1 year ago

In epic-kitchens features from actionformer repo, duration of P01_01.mp4 and its feature shape is (3097, 2304).

Important detail from repo:

Details: The features are extracted from the SlowFast model pretrained on the training set of EPIC Kitchens 100 (action classification) using clips of 32 frames at a frame rate of 30 fps and a stride of 16 frames. This gives one feature vector per 16/30 ~= 0.5333 seconds.

So what I understand is that

  1. change fps of all videos from 60 to 30.
  2. extract (1, 2304) for each 32 frames.
  3. Repeat step2 with window length 32 and stride length 16.

For example, if we have a 1-minute video, we have 60 * 30 = 180 frames. The number of feature extraction = ( 180 - 32 ) / 16 ~= 10. Hence the final feature size = (10, 2304)

EmreOzkose commented 1 year ago

I couldn't find a slow-fast model which is pre trained on epic-kitchen. Which model did you use ?

EDİT: You already wrote that this model is OK above. So you didn't use Epic-kitchen pretrained model like actionformer. You just use kinetics pretrained slowfast.

dingfengshi commented 1 year ago

I couldn't find a slow-fast model which is pre trained on epic-kitchen. Which model did you use ?

EDİT: You already wrote that this model is OK above. So you didn't use Epic-kitchen pretrained model like actionformer. You just use kinetics pretrained slowfast.

Thank you for your detailed instrcution. We do not do end to end training with the Slowfast backbone and for the Epic-kitchen, we directly use the features from Actionforner repo for convience. we do not have the pretrained Slowfast weights for Epic-kitchen. But if you want to test on your own dataset, I think the above model is OK (finetuning on your own dataset is better)

EmreOzkose commented 1 year ago

I asked for the model in actionformer repo and found it in another repo :). I tried both kinetics and epic pre-trained models. I think epic is better than kinetics (I am in cooking domain). I just tried a few random cooking videos from Youtube, but the model usually gives very short segments (like 1-5 seconds).

I will make a PR for feature extraction just for showing codes, people can use from that PR.

EmreOzkose commented 1 year ago

I realized that, when I use random cooking videos from Youtube, labels are wrong for both verbs and nouns. Segments include actions, but labels are not correct. I will dive into pre-processing and make a PR later.

dingfengshi commented 1 year ago

I realized that, when I use random cooking videos from Youtube, labels are wrong for both verbs and nouns. Segments include actions, but labels are not correct. I will dive into pre-processing and make a PR later.

OK, thank you for your contribution!

elmsamcht2189 commented 8 months ago

在来自actionformer repo的epic-kitchens特征中,P01_01.mp4的持续时间及其特征形状是(3097, 2304)

回购协议的重要细节:

Details: The features are extracted from the SlowFast model pretrained on the training set of EPIC Kitchens 100 (action classification) using clips of 32 frames at a frame rate of 30 fps and a stride of 16 frames. This gives one feature vector per 16/30 ~= 0.5333 seconds.

所以我的理解是

  1. 将所有视频的 fps 从 60 更改为 30。
  2. 每 32 帧提取 (1, 2304)。
  3. 使用窗口长度 32 和步幅长度 16 重复步骤 2。

例如,如果我们有一个 1 分钟的视频,则有 60 * 30 = 180 帧。特征提取数量 = ( 180 - 32 ) / 16 ~= 10。因此最终特征大小 = (10, 2304)

    # convert time stamp (in second) into temporal feature grids
    # ok to have small negative values here
    if video_item['segments'] is not None:
        segments = torch.from_numpy(
            (video_item['segments'] * video_item['fps'] - 0.5 * self.num_frames) / feat_stride
        )
        labels = torch.from_numpy(video_item['labels'])
    else:
        segments, labels = None, None