UARK-AICV / AOE-Net

[IJCV] AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation
https://arxiv.org/abs/2210.02578
19 stars 1 forks source link

data processing #7

Open try-harder12 opened 10 months ago

try-harder12 commented 10 months ago

May I ask what network is used to extract environmental features and participant features, what is the format of the extracted data, and how is it converted into the current data input format of the aoe-net network? Could you please show me some details of data processing?

vhvkhoa commented 10 months ago

Thank you for your interest. I wrote some code over SlowFast to extract features for ActivityNet and THUMOS-14.

Because I didn't have time to clean the feature extraction code so I can't publish it. But all extracted features (environment, actors, and objects) for ActivityNet and THUMOS are available in this repo.

try-harder12 commented 10 months ago

Thank you for your answer.I want to implement your work on my own dataset, but it seems to be very difficult.I really need detailed information on data processing.

vhvkhoa commented 10 months ago

For the videos in Activitynet, I simply rescale them to 1600 frames and extract features with a window size of 16 frames so that each video will be represented by a sequence of 100 features. For THUMOS-14, videos are much longer with small groundtruth action segments. So, I create a sliding window of 12816 frames, with a stride of 6416 frames, then a video is represented by multiple splits, each having 128 features.

The above processing in both videos follows the success experiments of BMN, G-TAD. But I observed that more recent methods in temporal action detection can now work on original size videos in both datasets.

try-harder12 commented 9 months ago

Thank you for your answer. Is the sliding window non overlapping for handling thumos14?

vhvkhoa commented 9 months ago

Thank you for your answer. Is the sliding window non overlapping for handling thumos14?

Sorry, I think my last comment gets misrendered. I use a sliding window of 128 snippets, with a stride of 64 snippets. Where each snippet is a sequence of 16 consecutive frames. And the snippets are non-overlapping.