UARK-AICV / AOE-Net

[IJCV] AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation
https://arxiv.org/abs/2210.02578
19 stars 1 forks source link

.json files #6

Closed Rui-hue closed 11 months ago

Rui-hue commented 1 year ago

Can you share the code for generating feature data. json files and label data. json files? I don't know how to generate data in this format.

vhvkhoa commented 1 year ago

Hi Rui-hue,

I modified SlowFast source code to extract agent features and environment features. The model I used is SlowOnly_8x8_R50. https://github.com/facebookresearch/SlowFast

Because the code is not cleaned so I didn't publish it. I saw a repo that does feature extraction on SlowFast: https://github.com/tridivb/slowfast_feature_extractor Maybe you'd want to use it.

Rui-hue commented 1 year ago

May I ask if the “segment” [0,16] “features” in env feature data refers to the segment feature of a video segment with 0-16 frames as the segment feature? 微信图片_20231103103946

vhvkhoa commented 12 months ago

yes, but it was supposed to be used only for my debugging.

67587597 commented 12 months ago

yes, but it was supposed to be used only for my debugging. >

Hi, thank you for your work. I want to concatenate features of another modality with the environment features. I assumed that each video was segmented into 100 segments and then the features extracted for each segment, but your reply here and the method mentioned in the paper

divide V into a sequence of δ−frame snippets. >

are a bit confusing. Did you fix the sampling rate for all segments, or you focused on producing fixed number of segments with varying sampling rate? Thank you.

vhvkhoa commented 11 months ago

yes, but it was supposed to be used only for my debugging. >

Hi, thank you for your work. I want to concatenate features of another modality with the environment features. I assumed that each video was segmented into 100 segments and then the features extracted for each segment, but your reply here and the method mentioned in the paper

divide V into a sequence of δ−frame snippets. >

are a bit confusing. Did you fix the sampling rate for all segments, or you focused on producing fixed number of segments with varying sampling rate? Thank you.

I think segments and snippets in your comment are the same. Sorry for the confusion, but I should use snippet in the json file to be more precise.

For ActivityNet, I rescaled all videos to 1600 frames, then I extract features using SlowFast source code (SlowOnly model) with snippet size of 16. So, every video is represented by a sequence of 100 features.