Closed hongminglin08 closed 1 day ago
Hi,
This depends on the size of the input window you wish to feed in. To replicate our results, you can follow the ReadME here for different configurations (note how we specify --feat_stride 2
and so on for Perception Test and --lambda_audio 0.01
for EPIC).
If the values aren't specified, they are left to the default values. If you wish to see what these values are, refer to the default values in the parser file in recognition/time_interval_machine/utils/parser.py
.
Hope this helps!
Hi,
This depends on the size of the input window you wish to feed in. To replicate our results, you can follow the ReadME here for different configurations (note how we specify
--feat_stride 2
and so on for Perception Test and--lambda_audio 0.01
for EPIC).If the values aren't specified, they are left to the default values. If you wish to see what these values are, refer to the default values in the parser file in
recognition/time_interval_machine/utils/parser.py
.Hope this helps!
Is it OK to use the following command? However, the results we get after training with this command will be quite different from the results in the paper. Do we have any parameter Settings that are inaccurate?
python scripts/run_net.py \ --train \ --output_dir /output \ --video_data_path /dataset/Perception1/video/ \ --video_train_action_pickle /dataset/annotations/Perception_Test/Perception_Test_Action_train.pkl \ --video_val_action_pickle /dataset/annotations/Perception_Test/Perception_Test_Action_validation.pkl \ --video_train_context_pickle /dataset/annotations/Perception_Test/Perception_Test_1_second_train_feature_times.pkl \ --video_val_context_pickle /dataset/annotations/Perception_Test/Perception_Test_1_second_validation_feature_times.pkl \ --visual_input_dim 1024 \ --audio_data_path /dataset/Perception1/audio/ \ --audio_train_action_pickle /dataset/annotations/Perception_Test/Perception_Test_Sound_train.pkl \ --audio_val_action_pickle /dataset/annotations/Perception_Test/Perception_Test_Sound_validation.pkl \ --audio_train_context_pickle /dataset/annotations/Perception_Test/Perception_Test_1_second_train_feature_times.pkl \ --audio_val_context_pickle /dataset/annotations/Perception_Test/Perception_Test_1_second_validation_feature_times.pkl \ --audio_input_dim 2304 \ --video_info_pickle /dataset/annotations/Perception_Test/Perception_Test_video_info.pkl \ --dataset perception \ --feat_stride 2 \ --feat_dropout 0.1 \ --seq_dropout 0.1 \ --include_verb_noun False
This looks correct to me. Is your condo environment the same as specified here. As we found the model to change results depending on the version of torch used etc. There may also be variances in results between GPUs.
This looks correct to me. Is your condo environment the same as specified here. As we found the model to change results depending on the version of torch used etc. There may also be variances in results between GPUs.
The gpu I use is a100, and the Audio Acc can reach more than 50%, while the Visual Action Acc can only reach about 35%, which is quite different from the 61% in the paper
This could be attributed to the visual features potentially. Could you detail how you extracted the omnivore features? I.e which pre-trained model, how many augmented feature sets and so on?
This could be attributed to the visual features potentially. Could you detail how you extracted the omnivore features? I.e which pre-trained model, how many augmented feature sets and so on?
We extracted the visual features by following the steps you provided in omnivore. The training set uses the following parameters BATCH_SIZE=4, test.enable =True, and NUM_FEATURES=4, while the TEST set changes NUM_FEATURES=1
I've just compared all the default config values with our reported training config for the model, and indeed they are all correct, and the script you provided earlier is exact, so the training configuration isn't a problem.
As a sanity check, if you run our pre-trained model here, does it achieve similar results to those stated in the paper? If not, I believe we can determine that the input features are the issue.
I've just compared all the default config values with our reported training config for the model, and indeed they are all correct, and the script you provided earlier is exact, so the training configuration isn't a problem.
As a sanity check, if you run our pre-trained model here, does it achieve similar results to those stated in the paper? If not, I believe we can determine that the input features as the issue.
You are right, there are errors in the validation dataset, I will reprocess the data. Thank you for your guidance
When I want to train the epic or perception dataset using Omnivore + AuditorySlowfast features, how should I set parameters such as feat_stride, feat_gap, num_feats, feat_dropout, seq_dropout, apply_feature_pooling, lambda_audio, lambda_drloc, mixup_alpha?