Closed syg1996419 closed 1 year ago
Hi! If you want to use video as input, you need to use a pretrained video recognition network (e.g. SlowFast, TSP) to extract the temporal features of the video first, then feed the feature squence into Tridet.
Hi @dingfengshi, thanks for sharing your excellent work.
Can I use this model for slowfast? If not, which model did you use?
Hi @dingfengshi, thanks for sharing your excellent work.
Can I use this model for slowfast? If not, which model did you use?
Hi, it seems OK if you use this code. I guess you can remove the classification head with model.blocks[-1]=torch.nn.Identity()
and extract the feature for each snippet.
Great, I will try today. I can compare features from torchvideo with features from actionformer repo to ensure they are same/close.
when I use model.blocks[-1]=torch.nn.Identity()
, output shape is torch.Size([1, 2304, 1, 2, 2])
. I think it should be model.blocks[6].proj = torch.nn.Identity()
, so output shape is torch.Size([1, 2304])
. The last block includes a pooling layer :
(6): ResNetBasicHead(
(dropout): Dropout(p=0.5, inplace=False)
(proj): Identity()
(output_pool): AdaptiveAvgPool3d(output_size=1)
)
In epic-kitchens features from actionformer repo, duration of P01_01
.mp4 and its feature shape is (3097, 2304)
.
Important detail from repo:
Details: The features are extracted from the SlowFast model pretrained on the training set of EPIC Kitchens 100 (action classification) using clips of 32 frames at a frame rate of 30 fps and a stride of 16 frames. This gives one feature vector per 16/30 ~= 0.5333 seconds.
So what I understand is that
For example, if we have a 1-minute video, we have 60 * 30 = 180 frames. The number of feature extraction = ( 180 - 32 ) / 16 ~= 10. Hence the final feature size = (10, 2304)
I couldn't find a slow-fast model which is pre trained on epic-kitchen. Which model did you use ?
EDİT: You already wrote that this model is OK above. So you didn't use Epic-kitchen pretrained model like actionformer. You just use kinetics pretrained slowfast.
I couldn't find a slow-fast model which is pre trained on epic-kitchen. Which model did you use ?
EDİT: You already wrote that this model is OK above. So you didn't use Epic-kitchen pretrained model like actionformer. You just use kinetics pretrained slowfast.
Thank you for your detailed instrcution. We do not do end to end training with the Slowfast backbone and for the Epic-kitchen, we directly use the features from Actionforner repo for convience. we do not have the pretrained Slowfast weights for Epic-kitchen. But if you want to test on your own dataset, I think the above model is OK (finetuning on your own dataset is better)
I asked for the model in actionformer repo and found it in another repo :). I tried both kinetics and epic pre-trained models. I think epic is better than kinetics (I am in cooking domain). I just tried a few random cooking videos from Youtube, but the model usually gives very short segments (like 1-5 seconds).
I will make a PR for feature extraction just for showing codes, people can use from that PR.
I realized that, when I use random cooking videos from Youtube, labels are wrong for both verbs and nouns. Segments include actions, but labels are not correct. I will dive into pre-processing and make a PR later.
I realized that, when I use random cooking videos from Youtube, labels are wrong for both verbs and nouns. Segments include actions, but labels are not correct. I will dive into pre-processing and make a PR later.
OK, thank you for your contribution!
在来自actionformer repo的epic-kitchens特征中,
P01_01
.mp4的持续时间及其特征形状是(3097, 2304)
。回购协议的重要细节:
Details: The features are extracted from the SlowFast model pretrained on the training set of EPIC Kitchens 100 (action classification) using clips of 32 frames at a frame rate of 30 fps and a stride of 16 frames. This gives one feature vector per 16/30 ~= 0.5333 seconds.
所以我的理解是
- 将所有视频的 fps 从 60 更改为 30。
- 每 32 帧提取 (1, 2304)。
- 使用窗口长度 32 和步幅长度 16 重复步骤 2。
例如,如果我们有一个 1 分钟的视频,则有 60 * 30 = 180 帧。特征提取数量 = ( 180 - 32 ) / 16 ~= 10。因此最终特征大小 = (10, 2304)
# convert time stamp (in second) into temporal feature grids
# ok to have small negative values here
if video_item['segments'] is not None:
segments = torch.from_numpy(
(video_item['segments'] * video_item['fps'] - 0.5 * self.num_frames) / feat_stride
)
labels = torch.from_numpy(video_item['labels'])
else:
segments, labels = None, None
How to test video