JacobChalk / TIM

Codebase for the paper: "TIM: A Time Interval Machine for Audio-Visual Action Recognition"
22 stars 3 forks source link

Input parameter setting #17

Closed hongminglin08 closed 1 day ago

hongminglin08 commented 2 days ago

When I want to train the epic or perception dataset using Omnivore + AuditorySlowfast features, how should I set parameters such as feat_stride, feat_gap, num_feats, feat_dropout, seq_dropout, apply_feature_pooling, lambda_audio, lambda_drloc, mixup_alpha?

JacobChalk commented 2 days ago

Hi,

This depends on the size of the input window you wish to feed in. To replicate our results, you can follow the ReadME here for different configurations (note how we specify --feat_stride 2 and so on for Perception Test and --lambda_audio 0.01for EPIC).

If the values aren't specified, they are left to the default values. If you wish to see what these values are, refer to the default values in the parser file in recognition/time_interval_machine/utils/parser.py.

Hope this helps!

hongminglin08 commented 2 days ago

Hi,

This depends on the size of the input window you wish to feed in. To replicate our results, you can follow the ReadME here for different configurations (note how we specify --feat_stride 2 and so on for Perception Test and --lambda_audio 0.01for EPIC).

If the values aren't specified, they are left to the default values. If you wish to see what these values are, refer to the default values in the parser file in recognition/time_interval_machine/utils/parser.py.

Hope this helps!

Is it OK to use the following command? However, the results we get after training with this command will be quite different from the results in the paper. Do we have any parameter Settings that are inaccurate?

python scripts/run_net.py \ --train \ --output_dir /output \ --video_data_path /dataset/Perception1/video/ \ --video_train_action_pickle /dataset/annotations/Perception_Test/Perception_Test_Action_train.pkl \ --video_val_action_pickle /dataset/annotations/Perception_Test/Perception_Test_Action_validation.pkl \ --video_train_context_pickle /dataset/annotations/Perception_Test/Perception_Test_1_second_train_feature_times.pkl \ --video_val_context_pickle /dataset/annotations/Perception_Test/Perception_Test_1_second_validation_feature_times.pkl \ --visual_input_dim 1024 \ --audio_data_path /dataset/Perception1/audio/ \ --audio_train_action_pickle /dataset/annotations/Perception_Test/Perception_Test_Sound_train.pkl \ --audio_val_action_pickle /dataset/annotations/Perception_Test/Perception_Test_Sound_validation.pkl \ --audio_train_context_pickle /dataset/annotations/Perception_Test/Perception_Test_1_second_train_feature_times.pkl \ --audio_val_context_pickle /dataset/annotations/Perception_Test/Perception_Test_1_second_validation_feature_times.pkl \ --audio_input_dim 2304 \ --video_info_pickle /dataset/annotations/Perception_Test/Perception_Test_video_info.pkl \ --dataset perception \ --feat_stride 2 \ --feat_dropout 0.1 \ --seq_dropout 0.1 \ --include_verb_noun False

JacobChalk commented 2 days ago

This looks correct to me. Is your condo environment the same as specified here. As we found the model to change results depending on the version of torch used etc. There may also be variances in results between GPUs.

hongminglin08 commented 2 days ago

This looks correct to me. Is your condo environment the same as specified here. As we found the model to change results depending on the version of torch used etc. There may also be variances in results between GPUs.

The gpu I use is a100, and the Audio Acc can reach more than 50%, while the Visual Action Acc can only reach about 35%, which is quite different from the 61% in the paper

JacobChalk commented 2 days ago

This could be attributed to the visual features potentially. Could you detail how you extracted the omnivore features? I.e which pre-trained model, how many augmented feature sets and so on?

hongminglin08 commented 2 days ago

This could be attributed to the visual features potentially. Could you detail how you extracted the omnivore features? I.e which pre-trained model, how many augmented feature sets and so on?

We extracted the visual features by following the steps you provided in omnivore. The training set uses the following parameters BATCH_SIZE=4, test.enable =True, and NUM_FEATURES=4, while the TEST set changes NUM_FEATURES=1

JacobChalk commented 2 days ago

I've just compared all the default config values with our reported training config for the model, and indeed they are all correct, and the script you provided earlier is exact, so the training configuration isn't a problem.

As a sanity check, if you run our pre-trained model here, does it achieve similar results to those stated in the paper? If not, I believe we can determine that the input features are the issue.

hongminglin08 commented 1 day ago

I've just compared all the default config values with our reported training config for the model, and indeed they are all correct, and the script you provided earlier is exact, so the training configuration isn't a problem.

As a sanity check, if you run our pre-trained model here, does it achieve similar results to those stated in the paper? If not, I believe we can determine that the input features as the issue.

You are right, there are errors in the validation dataset, I will reprocess the data. Thank you for your guidance