Closed WannaSir closed 3 months ago
Hi,
The shared model relies is trained and evaluated on Perception Test Action and Sound jointly, so it is likely not loading all the weights properly when evaluating separately. Would you be able to output the results when using --model_modality audio_visual
and --data_modality audio_visual
? At the very least --model_modality audio_visual
is essential, as the model would have learned correlations between the two modalities.
Hi,
The shared model relies is trained and evaluated on Perception Test Action and Sound jointly, so it is likely not loading all the weights properly when evaluating separately. Would you be able to output the results when using
--model_modality audio_visual
and--data_modality audio_visual
? At the very least--model_modality audio_visual
is essential, as the model would have learned correlations between the two modalities.
Hi,
when i add --model_modality audio_visual
and --data_modality audio_visual
, the output is as follows:
[07/08 16:51:03] test.py - L221: | Epoch: [1][1/531] | Time: 5.043 | Data: 2.810 | Net: 2.203 | Visual Views Seen: 444 | Visual Loss: 3.3867 | Audio Views Seen: 268 | Audio Loss: 3.5245 | RAM: 40.05/503.51GB | GPU: 0.72/47.54GB |
[07/08 16:51:07] test.py - L221: | Epoch: [1][101/531] | Time: 0.041 | Data: 0.003 | Net: 0.010 | Visual Views Seen: 40415 | Visual Loss: 4.3145 | Audio Views Seen: 38488 | Audio Loss: 3.4950 | RAM: 40.32/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:12] test.py - L221: | Epoch: [1][201/531] | Time: 0.042 | Data: 0.003 | Net: 0.009 | Visual Views Seen: 78867 | Visual Loss: 4.3345 | Audio Views Seen: 74103 | Audio Loss: 3.4582 | RAM: 40.49/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:16] test.py - L221: | Epoch: [1][301/531] | Time: 0.042 | Data: 0.004 | Net: 0.008 | Visual Views Seen: 118005 | Visual Loss: 4.3236 | Audio Views Seen: 109549 | Audio Loss: 3.4540 | RAM: 40.69/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:20] test.py - L221: | Epoch: [1][401/531] | Time: 0.042 | Data: 0.004 | Net: 0.008 | Visual Views Seen: 156457 | Visual Loss: 4.3058 | Audio Views Seen: 146523 | Audio Loss: 3.4496 | RAM: 40.85/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:25] test.py - L221: | Epoch: [1][501/531] | Time: 0.041 | Data: 0.004 | Net: 0.007 | Visual Views Seen: 192591 | Visual Loss: 4.3113 | Audio Views Seen: 181700 | Audio Loss: 3.4455 | RAM: 40.96/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:27] test.py - L232:
Epoch 1 Results:
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 11.947
Visual Action Acc@5 30.330
------------------------------------------
Visual Loss 4.31476
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 5.381
Audio Acc@5 42.956
------------------------------------------
Audio Loss 3.44822
==========================================
Actions Seen: 71065
==========================================
add --model_modality audio_visual
and --data_modality audio
the output:
[07/08 17:02:36] test.py - L221: | Epoch: [1][1/522] | Time: 4.889 | Data: 2.788 | Net: 2.084 | Audio Views Seen: 268 | Audio Loss: 3.5245 | RAM: 40.75/503.51GB | GPU: 0.49/47.54GB |
[07/08 17:02:40] test.py - L221: | Epoch: [1][101/522] | Time: 0.028 | Data: 0.001 | Net: 0.008 | Audio Views Seen: 39248 | Audio Loss: 3.4942 | RAM: 41.44/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:02:46] test.py - L221: | Epoch: [1][201/522] | Time: 0.045 | Data: 0.018 | Net: 0.009 | Audio Views Seen: 75150 | Audio Loss: 3.4600 | RAM: 41.58/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:02:51] test.py - L221: | Epoch: [1][301/522] | Time: 0.027 | Data: 0.001 | Net: 0.009 | Audio Views Seen: 111565 | Audio Loss: 3.4529 | RAM: 41.69/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:02:57] test.py - L221: | Epoch: [1][401/522] | Time: 0.043 | Data: 0.002 | Net: 0.008 | Audio Views Seen: 148615 | Audio Loss: 3.4495 | RAM: 41.69/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:03:02] test.py - L221: | Epoch: [1][501/522] | Time: 0.044 | Data: 0.016 | Net: 0.009 | Audio Views Seen: 185122 | Audio Loss: 3.4464 | RAM: 41.87/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:03:04] test.py - L232:
Epoch 1 Results:
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 5.381
Audio Acc@5 42.961
------------------------------------------
Audio Loss 3.44822
==========================================
Actions Seen: 35625
==========================================
add --model_modality audio_visual
and --data_modality visual
the output is as follows:
[07/08 17:09:56] test.py - L221: | Epoch: [1][1/530] | Time: 3.300 | Data: 1.928 | Net: 1.351 | Visual Views Seen: 444 | Visual Loss: 3.3867 | RAM: 28.20/503.51GB | GPU: 0.53/47.54GB |
[07/08 17:09:59] test.py - L221: | Epoch: [1][101/530] | Time: 0.030 | Data: 0.001 | Net: 0.019 | Visual Views Seen: 40417 | Visual Loss: 4.3145 | RAM: 28.37/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:02] test.py - L221: | Epoch: [1][201/530] | Time: 0.033 | Data: 0.001 | Net: 0.009 | Visual Views Seen: 78875 | Visual Loss: 4.3345 | RAM: 28.43/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:05] test.py - L221: | Epoch: [1][301/530] | Time: 0.030 | Data: 0.001 | Net: 0.009 | Visual Views Seen: 118017 | Visual Loss: 4.3236 | RAM: 28.50/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:09] test.py - L221: | Epoch: [1][401/530] | Time: 0.030 | Data: 0.001 | Net: 0.009 | Visual Views Seen: 156491 | Visual Loss: 4.3060 | RAM: 28.59/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:12] test.py - L221: | Epoch: [1][501/530] | Time: 0.037 | Data: 0.001 | Net: 0.009 | Visual Views Seen: 192740 | Visual Loss: 4.3110 | RAM: 28.68/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:13] test.py - L232:
Epoch 1 Results:
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 11.950
Visual Action Acc@5 30.333
------------------------------------------
Visual Loss 4.31476
==========================================
Actions Seen: 35440
==========================================
the result above is quite different from the result that paper mentioned.
The arguments and now weights are identical, also based on your logs, the windows are being constructed correctly.
The final place to look at are the input features, as something may have gone wrong here. Are you able to provide an example output for the omnivore and auditory slowest feature feature for video_7723.npy
in the validation set? Here is an example of ours:
Omnivore:
>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 1024)
>>> test_arr[:5, :, :4]
array([
[[ 0.6521435 , -0.16762784, -0.5860762 , -0.80873215]],
[[ 1.4387008 , -0.36775348, -0.5040714 , -0.5070412 ]],
[[ 1.4224304 , -0.49926353, -0.45585755, -0.47254056]],
[[ 1.430981 , -0.47914758, -0.40739724, -0.47641215]],
[[ 1.4302788 , -0.51078564, -0.46458745, -0.45350054]]],
dtype=float32)
Auditory SlowFast:
>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 2304)
>>> test_arr[:5, :, :4]
array([
[[0.0278179 , 0. , 0.00105882, 0.19269007]],
[[0.10143611, 0.00118782, 0.29899448, 0.2853284 ]],
[[0.35923526, 0. , 0.21170901, 0.309517 ]],
[[0.47848836, 0. , 0.04065616, 0.4271591 ]],
[[0.50873214, 0. , 0.00262807, 0.26922128]]], dtype=float32)
For transparency, here is output log of our model/features as well:
[07/08 09:36:29] test.py - L221: | Epoch: [1][1/531] | Time: 14.524 | Data: 3.814 | Net: 9.605 | Visual Views Seen: 444 | Visual Loss: 1.9800 | Audio Views Seen: 268 | Audio Loss: 1.6494 | RAM: 43.93/503.24GB | GPU: 0.72/31.74GB |
[07/08 09:36:50] test.py - L221: | Epoch: [1][101/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 40415 | Visual Loss: 2.4932 | Audio Views Seen: 38488 | Audio Loss: 2.1767 | RAM: 44.20/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:02] test.py - L221: | Epoch: [1][201/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 78867 | Visual Loss: 2.5077 | Audio Views Seen: 74103 | Audio Loss: 2.1528 | RAM: 44.41/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:14] test.py - L221: | Epoch: [1][301/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 118005 | Visual Loss: 2.5179 | Audio Views Seen: 109549 | Audio Loss: 2.1230 | RAM: 44.59/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:26] test.py - L221: | Epoch: [1][401/531] | Time: 0.121 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 156457 | Visual Loss: 2.5370 | Audio Views Seen: 146523 | Audio Loss: 2.1170 | RAM: 44.75/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:38] test.py - L221: | Epoch: [1][501/531] | Time: 0.119 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 192591 | Visual Loss: 2.5344 | Audio Views Seen: 181700 | Audio Loss: 2.1100 | RAM: 44.87/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:42] test.py - L232:
Epoch 1 Results:
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 61.081
Visual Action Acc@5 84.664
------------------------------------------
Visual Loss 2.53442
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 56.065
Audio Acc@5 87.363
------------------------------------------
Audio Loss 2.10893
==========================================
Actions Seen: 71065
==========================================
The arguments and now weights are identical, also based on your logs, the windows are being constructed correctly.
The final place to look at are the input features, as something may have gone wrong here. Are you able to provide an example output for the omnivore and auditory slowest feature feature for
video_7723.npy
in the validation set? Here is an example of ours:Omnivore:
>>> import numpy as np >>> test_arr = np.load('video_7723.npy') >>> test_arr.shape (131, 1, 1024) >>> test_arr[:5, :, :5] >>> test_arr[:5, :, :4] array([ [[ 0.6521435 , -0.16762784, -0.5860762 , -0.80873215]], [[ 1.4387008 , -0.36775348, -0.5040714 , -0.5070412 ]], [[ 1.4224304 , -0.49926353, -0.45585755, -0.47254056]], [[ 1.430981 , -0.47914758, -0.40739724, -0.47641215]], [[ 1.4302788 , -0.51078564, -0.46458745, -0.45350054]]], dtype=float32)
Auditory SlowFast:
>>> import numpy as np >>> test_arr = np.load('video_7723.npy') >>> test_arr.shape (131, 1, 2304) >>> test_arr[:5, :, :4] array([ [[0.0278179 , 0. , 0.00105882, 0.19269007]], [[0.10143611, 0.00118782, 0.29899448, 0.2853284 ]], [[0.35923526, 0. , 0.21170901, 0.309517 ]], [[0.47848836, 0. , 0.04065616, 0.4271591 ]], [[0.50873214, 0. , 0.00262807, 0.26922128]]], dtype=float32)
For transparency, here is output log of our model/features as well:
[07/08 09:36:29] test.py - L221: | Epoch: [1][1/531] | Time: 14.524 | Data: 3.814 | Net: 9.605 | Visual Views Seen: 444 | Visual Loss: 1.9800 | Audio Views Seen: 268 | Audio Loss: 1.6494 | RAM: 43.93/503.24GB | GPU: 0.72/31.74GB | [07/08 09:36:50] test.py - L221: | Epoch: [1][101/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 40415 | Visual Loss: 2.4932 | Audio Views Seen: 38488 | Audio Loss: 2.1767 | RAM: 44.20/503.24GB | GPU: 0.77/31.74GB | [07/08 09:37:02] test.py - L221: | Epoch: [1][201/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 78867 | Visual Loss: 2.5077 | Audio Views Seen: 74103 | Audio Loss: 2.1528 | RAM: 44.41/503.24GB | GPU: 0.77/31.74GB | [07/08 09:37:14] test.py - L221: | Epoch: [1][301/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 118005 | Visual Loss: 2.5179 | Audio Views Seen: 109549 | Audio Loss: 2.1230 | RAM: 44.59/503.24GB | GPU: 0.77/31.74GB | [07/08 09:37:26] test.py - L221: | Epoch: [1][401/531] | Time: 0.121 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 156457 | Visual Loss: 2.5370 | Audio Views Seen: 146523 | Audio Loss: 2.1170 | RAM: 44.75/503.24GB | GPU: 0.77/31.74GB | [07/08 09:37:38] test.py - L221: | Epoch: [1][501/531] | Time: 0.119 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 192591 | Visual Loss: 2.5344 | Audio Views Seen: 181700 | Audio Loss: 2.1100 | RAM: 44.87/503.24GB | GPU: 0.77/31.74GB | [07/08 09:37:42] test.py - L232: Epoch 1 Results: ========================================== Visual Views Seen: 203778 ------------------------------------------ Visual Action Acc@1 61.081 Visual Action Acc@5 84.664 ------------------------------------------ Visual Loss 2.53442 ========================================== Audio Views Seen: 193115 ------------------------------------------ Audio Acc@1 56.065 Audio Acc@5 87.363 ------------------------------------------ Audio Loss 2.10893 ========================================== Actions Seen: 71065 ==========================================
Omnivore:
>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 1024)
>>> test_arr[:5, :, :5]
array([[[ 0.29367825, -0.00748634, -0.2776588 , -0.3009204 ,
0.3279833 ]],
[[ 0.25731656, -0.03482111, -0.21351193, -0.1365885 ,
0.22836693]],
[[ 0.36899644, 0.05199311, -0.21512416, -0.13552672,
0.26339605]],
[[ 0.3503914 , 0.1112382 , -0.18157834, -0.15410316,
0.28834268]],
[[ 0.37425473, 0.1081804 , -0.18313263, -0.22488765,
0.27559933]]], dtype=float32)
Auditory SlowFast:
>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 2304)
>>> test_arr[:5, :, :4]
array([[[0. , 0. , 0. , 0.00087965]],
[[0.03268336, 0.0721959 , 0. , 0. ]],
[[0.04737182, 0.11589228, 0. , 0.01764626]],
[[0.07842807, 0.27230534, 0.00110619, 0. ]],
[[0.07751907, 0.20392142, 0.03178129, 0.00857057]]], dtype=float32)
>>>
the result of >>> test_arr[:5, :, :5]
and >>> test_arr[:5, :, :4]
are different from yours. why?...
There is inevitably some variation between the features, as some operations in the backbones are non-deterministic, even with seeding this is not 100% fixable. This test was to see if there are any surprising changes between the two (infs and nans) etc. though it seems they are fine. For the audio backbone, did you use the pretrained model here?
There is inevitably some variation between the features, as some operations in the backbones are non-deterministic, even with seeding this is not 100% fixable. This test was to see if there are any surprising changes between the two (infs and nans) etc. though it seems they are fine. For the audio backbone, did you use the pretrained model here?
yes, i have used it:
TEST.CHECKPOINT_FILE_PATH /data/nvme2/dky/lc/TIM/feature_extractors/auditory_slowfast/pretrained_models/asf_vggsound.pyth \
There is inevitably some variation between the features, as some operations in the backbones are non-deterministic, even with seeding this is not 100% fixable. This test was to see if there are any surprising changes between the two (infs and nans) etc. though it seems they are fine. For the audio backbone, did you use the pretrained model here?
May i have the command you uesd to get the output log you provided?
The command we use is:
python scripts/run_net.py \
--validate \
--output_dir /path/to/output \
--video_data_path /path/to/perception_test_visual_features \
--video_train_action_pickle /path/to/perception_test_action_train_annotations \
--video_val_action_pickle /path/to/perception_test_action_validation_annotations \
--video_train_context_pickle /path/to/perception_test_action_train_visual_feature_intervals \
--video_val_context_pickle /path/to/perception_test_action_validation_visual_feature_intervals \
--visual_input_dim <channel-size-of-visual-features> \
--audio_data_path /path/to/perception_test_audio_features \
--audio_train_action_pickle /path/to/perception_test_sound_train_annotations \
--audio_val_action_pickle /path/to/perception_test_sound_validation_annotations \
--audio_train_context_pickle /path/to/perception_test_sound_train_audio_feature_intervals \
--audio_val_context_pickle /path/to/perception_test_sound_validation_audio_feature_intervals \
--audio_input_dim <channel-size-of-audio-features> \
--video_info_pickle /path/to/perception_test_video_metadata \
--dataset perception \
--feat_stride 2 \
--feat_dropout 0.1 \
--seq_dropout 0.1 \
--include_verb_noun False \
--pretrained_model /path/to/pretrained_model.pth.tar
The command we use is:
python scripts/run_net.py \ --validate \ --output_dir /path/to/output \ --video_data_path /path/to/perception_test_visual_features \ --video_train_action_pickle /path/to/perception_test_action_train_annotations \ --video_val_action_pickle /path/to/perception_test_action_validation_annotations \ --video_train_context_pickle /path/to/perception_test_action_train_visual_feature_intervals \ --video_val_context_pickle /path/to/perception_test_action_validation_visual_feature_intervals \ --visual_input_dim <channel-size-of-visual-features> \ --audio_data_path /path/to/perception_test_audio_features \ --audio_train_action_pickle /path/to/perception_test_sound_train_annotations \ --audio_val_action_pickle /path/to/perception_test_sound_validation_annotations \ --audio_train_context_pickle /path/to/perception_test_sound_train_audio_feature_intervals \ --audio_val_context_pickle /path/to/perception_test_sound_validation_audio_feature_intervals \ --audio_input_dim <channel-size-of-audio-features> \ --video_info_pickle /path/to/perception_test_video_metadata \ --dataset perception \ --feat_stride 2 \ --feat_dropout 0.1 \ --seq_dropout 0.1 \ --include_verb_noun False \ --pretrained_model /path/to/pretrained_model.pth.tar
Thanks, but it is weird , our command is the same, and the pretrained model i used is thishttps://www.dropbox.com/scl/fi/xzt8rbl19cumgl0v3gl2d/percetion_test_action_sound.pth.tar?rlkey=qsd7vbpddnftpk4mjq4j8dpnm&dl=0 your have provided , is it possible that the pretrained model we used is different?
The command we use is:
python scripts/run_net.py \ --validate \ --output_dir /path/to/output \ --video_data_path /path/to/perception_test_visual_features \ --video_train_action_pickle /path/to/perception_test_action_train_annotations \ --video_val_action_pickle /path/to/perception_test_action_validation_annotations \ --video_train_context_pickle /path/to/perception_test_action_train_visual_feature_intervals \ --video_val_context_pickle /path/to/perception_test_action_validation_visual_feature_intervals \ --visual_input_dim <channel-size-of-visual-features> \ --audio_data_path /path/to/perception_test_audio_features \ --audio_train_action_pickle /path/to/perception_test_sound_train_annotations \ --audio_val_action_pickle /path/to/perception_test_sound_validation_annotations \ --audio_train_context_pickle /path/to/perception_test_sound_train_audio_feature_intervals \ --audio_val_context_pickle /path/to/perception_test_sound_validation_audio_feature_intervals \ --audio_input_dim <channel-size-of-audio-features> \ --video_info_pickle /path/to/perception_test_video_metadata \ --dataset perception \ --feat_stride 2 \ --feat_dropout 0.1 \ --seq_dropout 0.1 \ --include_verb_noun False \ --pretrained_model /path/to/pretrained_model.pth.tar
Thanks, but it is weird , our command is the same, and the pretrained model i used is thishttps://www.dropbox.com/scl/fi/xzt8rbl19cumgl0v3gl2d/percetion_test_action_sound.pth.tar?rlkey=qsd7vbpddnftpk4mjq4j8dpnm&dl=0 your have provided , is it possible that the pretrained model we used is different?
the link is thishttps://www.dropbox.com/scl/fi/xzt8rbl19cumgl0v3gl2d/percetion_test_action_sound.pth.tar?rlkey=qsd7vbpddnftpk4mjq4j8dpnm&dl=0
I just ran the validation again using the dropbox link and the result is the same as the paper.
This leads me to believe that somewhere there has been an issue extracting the features. I am currently running the whole pipeline again (extracting features from raw data -> validating on newly extracted features) to see if there is an issue. Unfortunately this may take a while, so apologies for the inconvenience and thank you for your patience.
oh, Thank you very much to spend your precious time helping me, you are so great!! hope to have good news.
2024年7月8日 19:54,Jacob Chalk @.***> 写道:
I just ran the validation again using the dropbox link and the result is the same as the paper.
This leads me to believe that somewhere there has been an issue extracting the features. I am currently running the whole pipeline again (extracting features from raw data -> validating on newly extracted features) to see if there is an issue. Unfortunately this may take a while, so apologies for the inconvenience and thank you for your patience.
— Reply to this email directly, view it on GitHub https://github.com/JacobChalk/TIM/issues/30#issuecomment-2213808660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWSYHF5NKBMEUFGTHB52DWLZLJ4X5AVCNFSM6AAAAABKQKXJ5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTHAYDQNRWGA. You are receiving this because you authored the thread.
I just ran the validation again using the dropbox link and the result is the same as the paper.
This leads me to believe that somewhere there has been an issue extracting the features. I am currently running the whole pipeline again (extracting features from raw data -> validating on newly extracted features) to see if there is an issue. Unfortunately this may take a while, so apologies for the inconvenience and thank you for your patience.
oh! there is a point i have changed, in the file TIM/feature_extractors/omnivore/utils/perception_test/make_npyfiles.py:
because the previous code will lead to an error, so is this change made the result different?
Hi,
Thank you for your patience while we looked into this, but we have now found the error.
It turns out that the pre-trained backbones were reported wrong. For the Perception Test, we actually used the pre-trained EPIC backbones for both audio and visual modalities. We will update the ReadME's and config files accordingly.
You can see our experiments below to validate this, where we can clearly see the re-extracted EPIC features give comparable performance. Note that it will be incredibly difficult to replicate our exact numbers, due to variations in GPUs (the re-extracted features were done on different GPUs to the original) and also due to non-deterministic operations even at inference. However, a margin of error of a few percentage points either side is acceptable.
Original Visual + Original Audio:
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 61.081
Visual Action Acc@5 84.664
------------------------------------------
Visual Loss 2.53442
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 56.065
Audio Acc@5 87.363
------------------------------------------
Audio Loss 2.10893
==========================================
Actions Seen: 71065
==========================================
Original Visual + Re-Extracted Audio (EPIC-Sounds Pre-trained):
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 60.892
Visual Action Acc@5 84.636
------------------------------------------
Visual Loss 2.53560
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 56.067
Audio Acc@5 87.604
------------------------------------------
Audio Loss 2.10833
==========================================
Actions Seen: 71065
==========================================
Re-Extracted Visual (EPIC-100 Pre-trained) + Original Audio:
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 60.971
Visual Action Acc@5 84.605
------------------------------------------
Visual Loss 2.53463
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 56.199
Audio Acc@5 87.419
------------------------------------------
Audio Loss 2.10929
==========================================
Actions Seen: 71065
==========================================
Re-Extracted Visual (EPIC-100 Pre-trained) + Re-Extracted Audio (EPIC-Sounds Pre-trained):
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 60.858
Visual Action Acc@5 84.613
------------------------------------------
Visual Loss 2.53585
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 55.972
Audio Acc@5 87.545
------------------------------------------
Audio Loss 2.10878
==========================================
Actions Seen: 71065
==========================================
---------------------------- ERRONEOUS REPLICATION ----------------------------
Original Visual + Re-Extracted Audio (VGG-Sound Pre-trained):
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 48.109
Visual Action Acc@5 78.578
------------------------------------------
Visual Loss 2.73987
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 21.173
Audio Acc@5 82.495
------------------------------------------
Audio Loss 2.70152
==========================================
Actions Seen: 71065
==========================================
Re-Extracted Visual (Kinetics+SUN RGB-D Pre-trained) + Original Audio:
==========================================
Visual Views Seen: 203778
------------------------------------------
Visual Action Acc@1 24.602
Visual Action Acc@5 39.027
------------------------------------------
Visual Loss 4.33575
==========================================
Audio Views Seen: 193115
------------------------------------------
Audio Acc@1 43.416
Audio Acc@5 71.815
------------------------------------------
Audio Loss 2.45227
==========================================
Actions Seen: 71065
==========================================
when i use the pretrained model to recreate the accuracy in paper:
but the result is quite different: Perception Test Action:
Perception Test Sound:
My command:
and
My question: is there something wrong? leading to the result quite different Thank You!