JacobChalk / TIM

Codebase for the paper: "TIM: A Time Interval Machine for Audio-Visual Action Recognition"
36 stars 5 forks source link

recreate perception_test problems #30

Closed WannaSir closed 3 months ago

WannaSir commented 3 months ago

when i use the pretrained model to recreate the accuracy in paper:

image

but the result is quite different: Perception Test Action:

[07/08 15:10:24] test.py - L221: | Epoch: [1][1/530] | Time: 2.528 | Data: 1.476 | Net: 1.038 | Visual Views Seen: 444 | Visual Loss: 4.3088 | RAM: 22.22/503.51GB | GPU: 0.35/47.54GB |
[07/08 15:10:26] test.py - L221: | Epoch: [1][101/530] | Time: 0.025 | Data: 0.004 | Net: 0.005 | Visual Views Seen: 40417 | Visual Loss: 5.1348 | RAM: 22.31/503.51GB | GPU: 0.37/47.54GB |
[07/08 15:10:28] test.py - L221: | Epoch: [1][201/530] | Time: 0.018 | Data: 0.001 | Net: 0.004 | Visual Views Seen: 78875 | Visual Loss: 5.1217 | RAM: 22.40/503.51GB | GPU: 0.37/47.54GB |
[07/08 15:10:30] test.py - L221: | Epoch: [1][301/530] | Time: 0.018 | Data: 0.001 | Net: 0.005 | Visual Views Seen: 118017 | Visual Loss: 5.0983 | RAM: 22.48/503.51GB | GPU: 0.37/47.54GB |
[07/08 15:10:32] test.py - L221: | Epoch: [1][401/530] | Time: 0.019 | Data: 0.001 | Net: 0.007 | Visual Views Seen: 156491 | Visual Loss: 5.0751 | RAM: 22.57/503.51GB | GPU: 0.37/47.54GB |
[07/08 15:10:34] test.py - L221: | Epoch: [1][501/530] | Time: 0.022 | Data: 0.001 | Net: 0.004 | Visual Views Seen: 192740 | Visual Loss: 5.0790 | RAM: 22.64/503.51GB | GPU: 0.37/47.54GB |
[07/08 15:10:35] test.py - L232: 
Epoch 1 Results:
    ==========================================
    Visual Views Seen: 203778
    ------------------------------------------
    Visual Action Acc@1 4.204
    Visual Action Acc@5 13.016
    ------------------------------------------
    Visual Loss 5.08218
    ==========================================
    Actions Seen: 35440
    ==========================================

Perception Test Sound:

[07/08 14:13:46] test.py - L221: | Epoch: [1][1/522] | Time: 2.925 | Data: 1.666 | Net: 1.248 | Audio Views Seen: 268 | Audio Loss: 3.3955 | RAM: 24.72/503.51GB | GPU: 0.34/47.54GB |
[07/08 14:13:48] test.py - L221: | Epoch: [1][101/522] | Time: 0.019 | Data: 0.002 | Net: 0.006 | Audio Views Seen: 39248 | Audio Loss: 3.3847 | RAM: 24.88/503.51GB | GPU: 0.36/47.54GB |
[07/08 14:13:50] test.py - L221: | Epoch: [1][201/522] | Time: 0.020 | Data: 0.001 | Net: 0.006 | Audio Views Seen: 75150 | Audio Loss: 3.3666 | RAM: 24.94/503.51GB | GPU: 0.36/47.54GB |
[07/08 14:13:52] test.py - L221: | Epoch: [1][301/522] | Time: 0.019 | Data: 0.002 | Net: 0.006 | Audio Views Seen: 111565 | Audio Loss: 3.3688 | RAM: 25.01/503.51GB | GPU: 0.36/47.54GB |
[07/08 14:13:54] test.py - L221: | Epoch: [1][401/522] | Time: 0.016 | Data: 0.001 | Net: 0.006 | Audio Views Seen: 148615 | Audio Loss: 3.3686 | RAM: 25.05/503.51GB | GPU: 0.36/47.54GB |
[07/08 14:13:56] test.py - L221: | Epoch: [1][501/522] | Time: 0.017 | Data: 0.001 | Net: 0.006 | Audio Views Seen: 185122 | Audio Loss: 3.3729 | RAM: 25.13/503.51GB | GPU: 0.36/47.54GB |
[07/08 14:13:56] test.py - L232: 
Epoch 1 Results:
    ==========================================
    Audio Views Seen: 193115
    ------------------------------------------
    Audio Acc@1 2.964 
    Audio Acc@5 62.554
    ------------------------------------------
    Audio Loss 3.37354
    ==========================================
    Actions Seen: 35625
    ==========================================

My command:

python scripts/run_net.py \
—validate \
--output_dir data/nvme2/dky/lc/TIM/recognition/output \
--video_data_path /data/hdd/lishenshen/timfeature_data/perception_test-feature/val_v \
--video_train_action_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_Action_train.pkl \
--video_val_action_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_Action_validation.pkl \
--video_train_context_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_1_second_train_feature_times.pkl \
--video_val_context_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_1_second_validation_feature_times.pkl \
--visual_input_dim 1024 \
--audio_data_path /data/hdd/lishenshen/timfeature_data/perception_test-feature/val_a \
--audio_train_action_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_Sound_train.pkl \
--audio_val_action_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_Sound_validation.pkl \
--audio_train_context_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_1_second_train_feature_times.pkl \
--audio_val_context_pickle //data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_1_second_validation_feature_times.pkl \
--audio_input_dim 2304 \
--video_info_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_video_info.pkl \
--dataset perception \
--feat_stride 2 \
--feat_dropout 0.1 \
--seq_dropout 0.1 \
--include_verb_noun False \
--prertrained_model /data/nvme2/dky/lc/TIM/recognition/pretrained_models/percetion_test_action_sound.pth.tar \
--model_modality audio \ 
--data_modality audio

and

python scripts/run_net.py \
—validate \
--output_dir data/nvme2/dky/lc/TIM/recognition/output \
--video_data_path /data/hdd/lishenshen/timfeature_data/perception_test-feature/val_v \
--video_train_action_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_Action_train.pkl \
--video_val_action_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_Action_validation.pkl \
--video_train_context_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_1_second_train_feature_times.pkl \
--video_val_context_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_1_second_validation_feature_times.pkl \
--visual_input_dim 1024 \
--audio_data_path /data/hdd/lishenshen/timfeature_data/perception_test-feature/val_a \
--audio_train_action_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_Sound_train.pkl \
--audio_val_action_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_Sound_validation.pkl \
--audio_train_context_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_1_second_train_feature_times.pkl \
--audio_val_context_pickle //data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_1_second_validation_feature_times.pkl \
--audio_input_dim 2304 \
--video_info_pickle /data/nvme2/dky/lc/TIM/annotations/Perception_Test/Perception_Test_video_info.pkl \
--dataset perception \
--feat_stride 2 \
--feat_dropout 0.1 \
--seq_dropout 0.1 \
--include_verb_noun False \
--prertrained_model /data/nvme2/dky/lc/TIM/recognition/pretrained_models/percetion_test_action_sound.pth.tar \
--model_modality video \ 
--data_modality video

My question: is there something wrong? leading to the result quite different Thank You!

JacobChalk commented 3 months ago

Hi,

The shared model relies is trained and evaluated on Perception Test Action and Sound jointly, so it is likely not loading all the weights properly when evaluating separately. Would you be able to output the results when using --model_modality audio_visual and --data_modality audio_visual? At the very least --model_modality audio_visual is essential, as the model would have learned correlations between the two modalities.

WannaSir commented 3 months ago

Hi,

The shared model relies is trained and evaluated on Perception Test Action and Sound jointly, so it is likely not loading all the weights properly when evaluating separately. Would you be able to output the results when using --model_modality audio_visual and --data_modality audio_visual? At the very least --model_modality audio_visual is essential, as the model would have learned correlations between the two modalities.

Hi, when i add --model_modality audio_visual and --data_modality audio_visual , the output is as follows:

[07/08 16:51:03] test.py - L221: | Epoch: [1][1/531] | Time: 5.043 | Data: 2.810 | Net: 2.203 | Visual Views Seen: 444 | Visual Loss: 3.3867 | Audio Views Seen: 268 | Audio Loss: 3.5245 | RAM: 40.05/503.51GB | GPU: 0.72/47.54GB |
[07/08 16:51:07] test.py - L221: | Epoch: [1][101/531] | Time: 0.041 | Data: 0.003 | Net: 0.010 | Visual Views Seen: 40415 | Visual Loss: 4.3145 | Audio Views Seen: 38488 | Audio Loss: 3.4950 | RAM: 40.32/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:12] test.py - L221: | Epoch: [1][201/531] | Time: 0.042 | Data: 0.003 | Net: 0.009 | Visual Views Seen: 78867 | Visual Loss: 4.3345 | Audio Views Seen: 74103 | Audio Loss: 3.4582 | RAM: 40.49/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:16] test.py - L221: | Epoch: [1][301/531] | Time: 0.042 | Data: 0.004 | Net: 0.008 | Visual Views Seen: 118005 | Visual Loss: 4.3236 | Audio Views Seen: 109549 | Audio Loss: 3.4540 | RAM: 40.69/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:20] test.py - L221: | Epoch: [1][401/531] | Time: 0.042 | Data: 0.004 | Net: 0.008 | Visual Views Seen: 156457 | Visual Loss: 4.3058 | Audio Views Seen: 146523 | Audio Loss: 3.4496 | RAM: 40.85/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:25] test.py - L221: | Epoch: [1][501/531] | Time: 0.041 | Data: 0.004 | Net: 0.007 | Visual Views Seen: 192591 | Visual Loss: 4.3113 | Audio Views Seen: 181700 | Audio Loss: 3.4455 | RAM: 40.96/503.51GB | GPU: 0.77/47.54GB |
[07/08 16:51:27] test.py - L232: 
Epoch 1 Results:
        ==========================================
        Visual Views Seen: 203778
        ------------------------------------------
        Visual Action Acc@1 11.947
        Visual Action Acc@5 30.330
        ------------------------------------------
        Visual Loss 4.31476
        ==========================================
        Audio Views Seen: 193115
        ------------------------------------------
        Audio Acc@1 5.381 
        Audio Acc@5 42.956
        ------------------------------------------
        Audio Loss 3.44822
        ==========================================
        Actions Seen: 71065
        ==========================================

add --model_modality audio_visual and --data_modality audio the output:

[07/08 17:02:36] test.py - L221: | Epoch: [1][1/522] | Time: 4.889 | Data: 2.788 | Net: 2.084 | Audio Views Seen: 268 | Audio Loss: 3.5245 | RAM: 40.75/503.51GB | GPU: 0.49/47.54GB |
[07/08 17:02:40] test.py - L221: | Epoch: [1][101/522] | Time: 0.028 | Data: 0.001 | Net: 0.008 | Audio Views Seen: 39248 | Audio Loss: 3.4942 | RAM: 41.44/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:02:46] test.py - L221: | Epoch: [1][201/522] | Time: 0.045 | Data: 0.018 | Net: 0.009 | Audio Views Seen: 75150 | Audio Loss: 3.4600 | RAM: 41.58/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:02:51] test.py - L221: | Epoch: [1][301/522] | Time: 0.027 | Data: 0.001 | Net: 0.009 | Audio Views Seen: 111565 | Audio Loss: 3.4529 | RAM: 41.69/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:02:57] test.py - L221: | Epoch: [1][401/522] | Time: 0.043 | Data: 0.002 | Net: 0.008 | Audio Views Seen: 148615 | Audio Loss: 3.4495 | RAM: 41.69/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:03:02] test.py - L221: | Epoch: [1][501/522] | Time: 0.044 | Data: 0.016 | Net: 0.009 | Audio Views Seen: 185122 | Audio Loss: 3.4464 | RAM: 41.87/503.51GB | GPU: 0.52/47.54GB |
[07/08 17:03:04] test.py - L232: 
Epoch 1 Results:
        ==========================================
        Audio Views Seen: 193115
        ------------------------------------------
        Audio Acc@1 5.381 
        Audio Acc@5 42.961
        ------------------------------------------
        Audio Loss 3.44822
        ==========================================
        Actions Seen: 35625
        ==========================================

add --model_modality audio_visual and --data_modality visual the output is as follows:

[07/08 17:09:56] test.py - L221: | Epoch: [1][1/530] | Time: 3.300 | Data: 1.928 | Net: 1.351 | Visual Views Seen: 444 | Visual Loss: 3.3867 | RAM: 28.20/503.51GB | GPU: 0.53/47.54GB |
[07/08 17:09:59] test.py - L221: | Epoch: [1][101/530] | Time: 0.030 | Data: 0.001 | Net: 0.019 | Visual Views Seen: 40417 | Visual Loss: 4.3145 | RAM: 28.37/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:02] test.py - L221: | Epoch: [1][201/530] | Time: 0.033 | Data: 0.001 | Net: 0.009 | Visual Views Seen: 78875 | Visual Loss: 4.3345 | RAM: 28.43/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:05] test.py - L221: | Epoch: [1][301/530] | Time: 0.030 | Data: 0.001 | Net: 0.009 | Visual Views Seen: 118017 | Visual Loss: 4.3236 | RAM: 28.50/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:09] test.py - L221: | Epoch: [1][401/530] | Time: 0.030 | Data: 0.001 | Net: 0.009 | Visual Views Seen: 156491 | Visual Loss: 4.3060 | RAM: 28.59/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:12] test.py - L221: | Epoch: [1][501/530] | Time: 0.037 | Data: 0.001 | Net: 0.009 | Visual Views Seen: 192740 | Visual Loss: 4.3110 | RAM: 28.68/503.51GB | GPU: 0.57/47.54GB |
[07/08 17:10:13] test.py - L232: 
Epoch 1 Results:
        ==========================================
        Visual Views Seen: 203778
        ------------------------------------------
        Visual Action Acc@1 11.950
        Visual Action Acc@5 30.333
        ------------------------------------------
        Visual Loss 4.31476
        ==========================================
        Actions Seen: 35440
        ==========================================

the result above is quite different from the result that paper mentioned.

JacobChalk commented 3 months ago

The arguments and now weights are identical, also based on your logs, the windows are being constructed correctly.

The final place to look at are the input features, as something may have gone wrong here. Are you able to provide an example output for the omnivore and auditory slowest feature feature for video_7723.npy in the validation set? Here is an example of ours:

Omnivore:

>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 1024)
>>> test_arr[:5, :, :4]
array([
       [[ 0.6521435 , -0.16762784, -0.5860762 , -0.80873215]],
       [[ 1.4387008 , -0.36775348, -0.5040714 , -0.5070412 ]],
       [[ 1.4224304 , -0.49926353, -0.45585755, -0.47254056]],
       [[ 1.430981  , -0.47914758, -0.40739724, -0.47641215]],
       [[ 1.4302788 , -0.51078564, -0.46458745, -0.45350054]]],
      dtype=float32)

Auditory SlowFast:

>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 2304)
>>> test_arr[:5, :, :4]
array([
       [[0.0278179 , 0.        , 0.00105882, 0.19269007]],
       [[0.10143611, 0.00118782, 0.29899448, 0.2853284 ]],
       [[0.35923526, 0.        , 0.21170901, 0.309517  ]],
       [[0.47848836, 0.        , 0.04065616, 0.4271591 ]],
       [[0.50873214, 0.        , 0.00262807, 0.26922128]]], dtype=float32)

For transparency, here is output log of our model/features as well:

[07/08 09:36:29] test.py - L221: | Epoch: [1][1/531] | Time: 14.524 | Data: 3.814 | Net: 9.605 | Visual Views Seen: 444 | Visual Loss: 1.9800 | Audio Views Seen: 268 | Audio Loss: 1.6494 | RAM: 43.93/503.24GB | GPU: 0.72/31.74GB |
[07/08 09:36:50] test.py - L221: | Epoch: [1][101/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 40415 | Visual Loss: 2.4932 | Audio Views Seen: 38488 | Audio Loss: 2.1767 | RAM: 44.20/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:02] test.py - L221: | Epoch: [1][201/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 78867 | Visual Loss: 2.5077 | Audio Views Seen: 74103 | Audio Loss: 2.1528 | RAM: 44.41/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:14] test.py - L221: | Epoch: [1][301/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 118005 | Visual Loss: 2.5179 | Audio Views Seen: 109549 | Audio Loss: 2.1230 | RAM: 44.59/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:26] test.py - L221: | Epoch: [1][401/531] | Time: 0.121 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 156457 | Visual Loss: 2.5370 | Audio Views Seen: 146523 | Audio Loss: 2.1170 | RAM: 44.75/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:38] test.py - L221: | Epoch: [1][501/531] | Time: 0.119 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 192591 | Visual Loss: 2.5344 | Audio Views Seen: 181700 | Audio Loss: 2.1100 | RAM: 44.87/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:42] test.py - L232: 
Epoch 1 Results:
    ==========================================
    Visual Views Seen: 203778
    ------------------------------------------
    Visual Action Acc@1 61.081
    Visual Action Acc@5 84.664
    ------------------------------------------
    Visual Loss 2.53442
    ==========================================
    Audio Views Seen: 193115
    ------------------------------------------
    Audio Acc@1 56.065 
    Audio Acc@5 87.363
    ------------------------------------------
    Audio Loss 2.10893
    ==========================================
    Actions Seen: 71065
    ==========================================
WannaSir commented 3 months ago

The arguments and now weights are identical, also based on your logs, the windows are being constructed correctly.

The final place to look at are the input features, as something may have gone wrong here. Are you able to provide an example output for the omnivore and auditory slowest feature feature for video_7723.npy in the validation set? Here is an example of ours:

Omnivore:

>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 1024)
>>> test_arr[:5, :, :5]
>>> test_arr[:5, :, :4]
array([
       [[ 0.6521435 , -0.16762784, -0.5860762 , -0.80873215]],
       [[ 1.4387008 , -0.36775348, -0.5040714 , -0.5070412 ]],
       [[ 1.4224304 , -0.49926353, -0.45585755, -0.47254056]],
       [[ 1.430981  , -0.47914758, -0.40739724, -0.47641215]],
       [[ 1.4302788 , -0.51078564, -0.46458745, -0.45350054]]],
      dtype=float32)

Auditory SlowFast:

>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 2304)
>>> test_arr[:5, :, :4]
array([
       [[0.0278179 , 0.        , 0.00105882, 0.19269007]],
       [[0.10143611, 0.00118782, 0.29899448, 0.2853284 ]],
       [[0.35923526, 0.        , 0.21170901, 0.309517  ]],
       [[0.47848836, 0.        , 0.04065616, 0.4271591 ]],
       [[0.50873214, 0.        , 0.00262807, 0.26922128]]], dtype=float32)

For transparency, here is output log of our model/features as well:

[07/08 09:36:29] test.py - L221: | Epoch: [1][1/531] | Time: 14.524 | Data: 3.814 | Net: 9.605 | Visual Views Seen: 444 | Visual Loss: 1.9800 | Audio Views Seen: 268 | Audio Loss: 1.6494 | RAM: 43.93/503.24GB | GPU: 0.72/31.74GB |
[07/08 09:36:50] test.py - L221: | Epoch: [1][101/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 40415 | Visual Loss: 2.4932 | Audio Views Seen: 38488 | Audio Loss: 2.1767 | RAM: 44.20/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:02] test.py - L221: | Epoch: [1][201/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 78867 | Visual Loss: 2.5077 | Audio Views Seen: 74103 | Audio Loss: 2.1528 | RAM: 44.41/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:14] test.py - L221: | Epoch: [1][301/531] | Time: 0.118 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 118005 | Visual Loss: 2.5179 | Audio Views Seen: 109549 | Audio Loss: 2.1230 | RAM: 44.59/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:26] test.py - L221: | Epoch: [1][401/531] | Time: 0.121 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 156457 | Visual Loss: 2.5370 | Audio Views Seen: 146523 | Audio Loss: 2.1170 | RAM: 44.75/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:38] test.py - L221: | Epoch: [1][501/531] | Time: 0.119 | Data: 0.001 | Net: 0.010 | Visual Views Seen: 192591 | Visual Loss: 2.5344 | Audio Views Seen: 181700 | Audio Loss: 2.1100 | RAM: 44.87/503.24GB | GPU: 0.77/31.74GB |
[07/08 09:37:42] test.py - L232: 
Epoch 1 Results:
  ==========================================
  Visual Views Seen: 203778
  ------------------------------------------
  Visual Action Acc@1 61.081
  Visual Action Acc@5 84.664
  ------------------------------------------
  Visual Loss 2.53442
  ==========================================
  Audio Views Seen: 193115
  ------------------------------------------
  Audio Acc@1 56.065 
  Audio Acc@5 87.363
  ------------------------------------------
  Audio Loss 2.10893
  ==========================================
  Actions Seen: 71065
  ==========================================

Omnivore:

>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 1024)
>>> test_arr[:5, :, :5]
array([[[ 0.29367825, -0.00748634, -0.2776588 , -0.3009204 ,
          0.3279833 ]],

       [[ 0.25731656, -0.03482111, -0.21351193, -0.1365885 ,
          0.22836693]],

       [[ 0.36899644,  0.05199311, -0.21512416, -0.13552672,
          0.26339605]],

       [[ 0.3503914 ,  0.1112382 , -0.18157834, -0.15410316,
          0.28834268]],

       [[ 0.37425473,  0.1081804 , -0.18313263, -0.22488765,
          0.27559933]]], dtype=float32)

Auditory SlowFast:

>>> import numpy as np
>>> test_arr = np.load('video_7723.npy')
>>> test_arr.shape
(131, 1, 2304)
>>> test_arr[:5, :, :4]
array([[[0.        , 0.        , 0.        , 0.00087965]],

       [[0.03268336, 0.0721959 , 0.        , 0.        ]],

       [[0.04737182, 0.11589228, 0.        , 0.01764626]],

       [[0.07842807, 0.27230534, 0.00110619, 0.        ]],

       [[0.07751907, 0.20392142, 0.03178129, 0.00857057]]], dtype=float32)
>>> 

the result of >>> test_arr[:5, :, :5] and >>> test_arr[:5, :, :4] are different from yours. why?...

JacobChalk commented 3 months ago

There is inevitably some variation between the features, as some operations in the backbones are non-deterministic, even with seeding this is not 100% fixable. This test was to see if there are any surprising changes between the two (infs and nans) etc. though it seems they are fine. For the audio backbone, did you use the pretrained model here?

WannaSir commented 3 months ago

There is inevitably some variation between the features, as some operations in the backbones are non-deterministic, even with seeding this is not 100% fixable. This test was to see if there are any surprising changes between the two (infs and nans) etc. though it seems they are fine. For the audio backbone, did you use the pretrained model here?

yes, i have used it: TEST.CHECKPOINT_FILE_PATH /data/nvme2/dky/lc/TIM/feature_extractors/auditory_slowfast/pretrained_models/asf_vggsound.pyth \

WannaSir commented 3 months ago

There is inevitably some variation between the features, as some operations in the backbones are non-deterministic, even with seeding this is not 100% fixable. This test was to see if there are any surprising changes between the two (infs and nans) etc. though it seems they are fine. For the audio backbone, did you use the pretrained model here?

May i have the command you uesd to get the output log you provided?

JacobChalk commented 3 months ago

The command we use is:

python scripts/run_net.py \
--validate \
--output_dir /path/to/output \
--video_data_path /path/to/perception_test_visual_features \
--video_train_action_pickle /path/to/perception_test_action_train_annotations \
--video_val_action_pickle /path/to/perception_test_action_validation_annotations \
--video_train_context_pickle /path/to/perception_test_action_train_visual_feature_intervals \
--video_val_context_pickle /path/to/perception_test_action_validation_visual_feature_intervals \
--visual_input_dim <channel-size-of-visual-features> \
--audio_data_path /path/to/perception_test_audio_features \
--audio_train_action_pickle /path/to/perception_test_sound_train_annotations \
--audio_val_action_pickle /path/to/perception_test_sound_validation_annotations \
--audio_train_context_pickle /path/to/perception_test_sound_train_audio_feature_intervals \
--audio_val_context_pickle /path/to/perception_test_sound_validation_audio_feature_intervals \
--audio_input_dim <channel-size-of-audio-features> \
--video_info_pickle /path/to/perception_test_video_metadata \
--dataset perception \
--feat_stride 2 \
--feat_dropout 0.1 \
--seq_dropout 0.1 \
--include_verb_noun False \
--pretrained_model /path/to/pretrained_model.pth.tar
WannaSir commented 3 months ago

The command we use is:

python scripts/run_net.py \
--validate \
--output_dir /path/to/output \
--video_data_path /path/to/perception_test_visual_features \
--video_train_action_pickle /path/to/perception_test_action_train_annotations \
--video_val_action_pickle /path/to/perception_test_action_validation_annotations \
--video_train_context_pickle /path/to/perception_test_action_train_visual_feature_intervals \
--video_val_context_pickle /path/to/perception_test_action_validation_visual_feature_intervals \
--visual_input_dim <channel-size-of-visual-features> \
--audio_data_path /path/to/perception_test_audio_features \
--audio_train_action_pickle /path/to/perception_test_sound_train_annotations \
--audio_val_action_pickle /path/to/perception_test_sound_validation_annotations \
--audio_train_context_pickle /path/to/perception_test_sound_train_audio_feature_intervals \
--audio_val_context_pickle /path/to/perception_test_sound_validation_audio_feature_intervals \
--audio_input_dim <channel-size-of-audio-features> \
--video_info_pickle /path/to/perception_test_video_metadata \
--dataset perception \
--feat_stride 2 \
--feat_dropout 0.1 \
--seq_dropout 0.1 \
--include_verb_noun False \
--pretrained_model /path/to/pretrained_model.pth.tar

Thanks, but it is weird , our command is the same, and the pretrained model i used is thishttps://www.dropbox.com/scl/fi/xzt8rbl19cumgl0v3gl2d/percetion_test_action_sound.pth.tar?rlkey=qsd7vbpddnftpk4mjq4j8dpnm&dl=0 your have provided , is it possible that the pretrained model we used is different?

WannaSir commented 3 months ago

The command we use is:

python scripts/run_net.py \
--validate \
--output_dir /path/to/output \
--video_data_path /path/to/perception_test_visual_features \
--video_train_action_pickle /path/to/perception_test_action_train_annotations \
--video_val_action_pickle /path/to/perception_test_action_validation_annotations \
--video_train_context_pickle /path/to/perception_test_action_train_visual_feature_intervals \
--video_val_context_pickle /path/to/perception_test_action_validation_visual_feature_intervals \
--visual_input_dim <channel-size-of-visual-features> \
--audio_data_path /path/to/perception_test_audio_features \
--audio_train_action_pickle /path/to/perception_test_sound_train_annotations \
--audio_val_action_pickle /path/to/perception_test_sound_validation_annotations \
--audio_train_context_pickle /path/to/perception_test_sound_train_audio_feature_intervals \
--audio_val_context_pickle /path/to/perception_test_sound_validation_audio_feature_intervals \
--audio_input_dim <channel-size-of-audio-features> \
--video_info_pickle /path/to/perception_test_video_metadata \
--dataset perception \
--feat_stride 2 \
--feat_dropout 0.1 \
--seq_dropout 0.1 \
--include_verb_noun False \
--pretrained_model /path/to/pretrained_model.pth.tar

Thanks, but it is weird , our command is the same, and the pretrained model i used is thishttps://www.dropbox.com/scl/fi/xzt8rbl19cumgl0v3gl2d/percetion_test_action_sound.pth.tar?rlkey=qsd7vbpddnftpk4mjq4j8dpnm&dl=0 your have provided , is it possible that the pretrained model we used is different?

the link is thishttps://www.dropbox.com/scl/fi/xzt8rbl19cumgl0v3gl2d/percetion_test_action_sound.pth.tar?rlkey=qsd7vbpddnftpk4mjq4j8dpnm&dl=0

JacobChalk commented 3 months ago

I just ran the validation again using the dropbox link and the result is the same as the paper.

This leads me to believe that somewhere there has been an issue extracting the features. I am currently running the whole pipeline again (extracting features from raw data -> validating on newly extracted features) to see if there is an issue. Unfortunately this may take a while, so apologies for the inconvenience and thank you for your patience.

WannaSir commented 3 months ago

oh, Thank you very much to spend your precious time helping me, you are so great!! hope to have good news.

2024年7月8日 19:54,Jacob Chalk @.***> 写道:

I just ran the validation again using the dropbox link and the result is the same as the paper.

This leads me to believe that somewhere there has been an issue extracting the features. I am currently running the whole pipeline again (extracting features from raw data -> validating on newly extracted features) to see if there is an issue. Unfortunately this may take a while, so apologies for the inconvenience and thank you for your patience.

— Reply to this email directly, view it on GitHub https://github.com/JacobChalk/TIM/issues/30#issuecomment-2213808660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWSYHF5NKBMEUFGTHB52DWLZLJ4X5AVCNFSM6AAAAABKQKXJ5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTHAYDQNRWGA. You are receiving this because you authored the thread.

WannaSir commented 3 months ago

I just ran the validation again using the dropbox link and the result is the same as the paper.

This leads me to believe that somewhere there has been an issue extracting the features. I am currently running the whole pipeline again (extracting features from raw data -> validating on newly extracted features) to see if there is an issue. Unfortunately this may take a while, so apologies for the inconvenience and thank you for your patience.

oh! there is a point i have changed, in the file TIM/feature_extractors/omnivore/utils/perception_test/make_npyfiles.py:

image

because the previous code will lead to an error, so is this change made the result different?

JacobChalk commented 3 months ago

Hi,

Thank you for your patience while we looked into this, but we have now found the error.

It turns out that the pre-trained backbones were reported wrong. For the Perception Test, we actually used the pre-trained EPIC backbones for both audio and visual modalities. We will update the ReadME's and config files accordingly.

You can see our experiments below to validate this, where we can clearly see the re-extracted EPIC features give comparable performance. Note that it will be incredibly difficult to replicate our exact numbers, due to variations in GPUs (the re-extracted features were done on different GPUs to the original) and also due to non-deterministic operations even at inference. However, a margin of error of a few percentage points either side is acceptable.

Original Visual + Original Audio:
    ==========================================
    Visual Views Seen: 203778
    ------------------------------------------
    Visual Action Acc@1 61.081
    Visual Action Acc@5 84.664
    ------------------------------------------
    Visual Loss 2.53442
    ==========================================
    Audio Views Seen: 193115
    ------------------------------------------
    Audio Acc@1 56.065 
    Audio Acc@5 87.363
    ------------------------------------------
    Audio Loss 2.10893
    ==========================================
    Actions Seen: 71065
    ==========================================

Original Visual + Re-Extracted Audio (EPIC-Sounds Pre-trained):
    ==========================================
    Visual Views Seen: 203778
    ------------------------------------------
    Visual Action Acc@1 60.892
    Visual Action Acc@5 84.636
    ------------------------------------------
    Visual Loss 2.53560
    ==========================================
    Audio Views Seen: 193115
    ------------------------------------------
    Audio Acc@1 56.067 
    Audio Acc@5 87.604
    ------------------------------------------
    Audio Loss 2.10833
    ==========================================
    Actions Seen: 71065
    ==========================================

Re-Extracted Visual (EPIC-100 Pre-trained) + Original Audio:
    ==========================================
    Visual Views Seen: 203778
    ------------------------------------------
    Visual Action Acc@1 60.971
    Visual Action Acc@5 84.605
    ------------------------------------------
    Visual Loss 2.53463
    ==========================================
    Audio Views Seen: 193115
    ------------------------------------------
    Audio Acc@1 56.199 
    Audio Acc@5 87.419
    ------------------------------------------
    Audio Loss 2.10929
    ==========================================
    Actions Seen: 71065
    ==========================================

Re-Extracted Visual (EPIC-100 Pre-trained) + Re-Extracted Audio (EPIC-Sounds Pre-trained):
    ==========================================
    Visual Views Seen: 203778
    ------------------------------------------
    Visual Action Acc@1 60.858
    Visual Action Acc@5 84.613
    ------------------------------------------
    Visual Loss 2.53585
    ==========================================
    Audio Views Seen: 193115
    ------------------------------------------
    Audio Acc@1 55.972 
    Audio Acc@5 87.545
    ------------------------------------------
    Audio Loss 2.10878
    ==========================================
    Actions Seen: 71065
    ==========================================

---------------------------- ERRONEOUS REPLICATION  ----------------------------

Original Visual + Re-Extracted Audio (VGG-Sound Pre-trained):
    ==========================================
    Visual Views Seen: 203778
    ------------------------------------------
    Visual Action Acc@1 48.109
    Visual Action Acc@5 78.578
    ------------------------------------------
    Visual Loss 2.73987
    ==========================================
    Audio Views Seen: 193115
    ------------------------------------------
    Audio Acc@1 21.173 
    Audio Acc@5 82.495
    ------------------------------------------
    Audio Loss 2.70152
    ==========================================
    Actions Seen: 71065
    ==========================================

Re-Extracted Visual (Kinetics+SUN RGB-D Pre-trained) + Original Audio:
    ==========================================
    Visual Views Seen: 203778
    ------------------------------------------
    Visual Action Acc@1 24.602
    Visual Action Acc@5 39.027
    ------------------------------------------
    Visual Loss 4.33575
    ==========================================
    Audio Views Seen: 193115
    ------------------------------------------
    Audio Acc@1 43.416 
    Audio Acc@5 71.815
    ------------------------------------------
    Audio Loss 2.45227
    ==========================================
    Actions Seen: 71065
    ==========================================