JacobChalk / TIM

Codebase for the paper: "TIM: A Time Interval Machine for Audio-Visual Action Recognition"
37 stars 5 forks source link

Questions about training #20

Closed ssp789 closed 3 months ago

ssp789 commented 3 months ago

Hello, is there a mismatch between the EPIC training script you provided and the dataset?

python scripts/run_net.py \ --train \ --output_dir /path/to/output \ --video_data_path /path/to/epic_visual_features \ --video_train_action_pickle /path/to/epic_100_train_annotations \ --video_val_action_pickle /path/to/epic_100_validation_annotations \ --video_train_context_pickle /path/to/epic_100_train_visual_feature_intervals \ --video_val_context_pickle /path/to/epic_100_validation_visual_feature_intervals \ --visual_input_dim \ --audio_data_path /path/to/epic_audio_features \ --audio_train_action_pickle /path/to/epic_sounds_train_annotations \ --audio_val_action_pickle /path/to/epic_sounds_validation_annotations \ --audio_train_context_pickle /path/to/epic_sounds_train_audio_feature_intervals \ --audio_val_context_pickle /path/to/epic_sounds_validation_audio_feature_intervals \ --audio_input_dim \ --video_info_pickle /path/to/epic_kitchens_video_metadata \ --lambda_audio 0.01

--video_train_context_pickle /path/to/epic_100_train_visual_feature_intervals \ --video_val_context_pickle /path/to/epic_100_validation_visual_feature_intervals \ --visual_input_dim \ and --audio_train_context_pickle /path/to/epic_sounds_train_audio_feature_intervals \ --audio_val_context_pickle /path/to/epic_sounds_validation_audio_feature_intervals \ --audio_input_dim \ --video_info_pickle /path/to/epic_kitchens_video_metadata \ Is it not provided? Or which file in the dataset should be provided? Thank you for your reply.

JacobChalk commented 3 months ago

Hi,

The context files are provided alongside the ground truth files here. These will account for you train_context and val_context pickle files for both the visual and audio modality.

For the visual_input_dim and audio_input_dim, it depends on what features you have extracted. Our audio features were auditory slowfast and hence audio_input_dim=2304. If you just extracted Omnivore, or VideoMAE, then visual_input_dim=1024. If you merged them along the channel dimension as we did, then visual_input_dim=2048.

JaesungHuh commented 3 months ago

You could find the context_files in here.

Feature dimension depends on what visual / audio features you are using. If you use Omnivore + ASlowFast, visual_input_dim should be 1024 and audio_input_dim should be 2304.

ssp789 commented 3 months ago

你好,

上下文文件与基本事实文件一起提供于此处。这些将为您解释train_context视觉val_context和音频模态的 pickle 文件。

对于visual_input_dimaudio_input_dim,这取决于你提取了哪些特征。我们的音频特征是听觉最慢的,因此audio_input_dim=2304。如果你只是提取了 Omnivore 或 VideoMAE,那么visual_input_dim=1024。如果你像我们一样沿通道维度合并它们,那么visual_input_dim=2048

Thank you for your reply.

ssp789 commented 3 months ago

您可以在这里找到 context_files 。

特征维度取决于您使用的视觉/音频特征。如果您使用 Omnivore + ASlowFast,visual_input_dim则应为 1024,否则audio_input_dim应为 2304。

Thank you for your reply.