IMCCretrieval / ProST

Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval --ICCV2023 Oral
Apache License 2.0
90 stars 1 forks source link

About ”compute_trick_metrics“ #4

Closed YangBowenn closed 9 months ago

YangBowenn commented 10 months ago

Thank you very much for sharing your work. I have a question and would like to seek clarification. Why do the results differ when computed using 'compute_trick_metrics' and 'compute_metrics'? The former yields a result of 48.3, whereas the latter, under the same parameters, can only achieve 40.3. It's worth noting that 'compute_metrics' is the calculation method employed by most methods compared in the paper. 屏幕截图 2024-01-25 171047

IMCCretrieval commented 10 months ago

Thank you for your attention. row_ind, col_ind = linear_sum_assignment(2.0 - sim_matrix_best) my_post = compute_trick_metrics(row_ind, col_ind) The above code is our proposed Text-Video Hungarian post-processing strategy. For detailed descriptions, please refer to Section 3 and Table 3 in our supplementary material.

YangBowenn commented 10 months ago

Thank you for your patient response. According to the supplementary materials, it is mentioned that without using the post-processing strategy, R@1 can reach 48.2 on the MSRVTT-9k dataset. However, when I attempted to replicate the results following the parameters mentioned in the paper, I found that the result was only around 40, which is significantly different from the results presented in the paper. The loading method for the MSRVTT dataset is referenced from the TS2-Net. I am unsure where the issue may be.

IMCCretrieval commented 10 months ago

Hello, I think you can try the following solutions:

  1. You can first check whether there are problems with the data. In the MSRVTT data set, CLIP4clip can reach about R@1 44.0.
  2. Then check the environment and configuration. I use 4 A100s and the batch size is 128: DATA_PATH=/data4/datasets/videos/MSRVTT CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \ main_retrieval.py --do_train --eval_in_train --num_thread_reader=8 --seed 42 \ --epochs=5 --batch_size=128 --n_display=50 \ --train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \ --val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \ --data_path ${DATA_PATH}/MSRVTT_data.json \ --features_path ${DATA_PATH}/compressed_videos_224_fps3 \ --output_dir ckpts/reproduce/msrvtt \ --datatype msrvtt --expand_msrvtt_sentences \ --cross_num_hidden_layers 4 \ --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 4 \ --feature_framerate 1 --coef_lr 1e-3 \ --freeze_layer_num 0 --slice_framepos 2 \ --loose_type --linear_patch 2d --sim_header seqTransf \ --pretrained_clip_name ViT-B/32 --max_patch 12 --max_word_pro 28
YangBowenn commented 9 months ago

Thank you for your answer. The absence of setting "--expand_msrvtt_sentences" led to the previous issue. Currently, the highest R@1 on the MSRVTT dataset can reach around 46.6, but there is still a significant gap compared to the 48.2 mentioned in the paper. It doesn't seem to be a random number issue.

IMCCretrieval commented 9 months ago

Sorry, it seems this issue does exist, for example https://github.com/yuqi657/ts2_net/issues/3. In addition to the impact of the experimental environment and GPU, some settings also need to be changed. When we organize experiments on the MSR-VTT dataset, the mask ratio and random seed need to be adjusted. The random seed is 42 and the mask ratio in the frame decoder is 0.5. You can also try other values to see if it works.