Ziyang412 / UCoFiA

Pytorch Code for "Unified Coarse-to-Fine Alignment for Video-Text Retrieval" (ICCV 2023)
https://arxiv.org/abs/2309.10091
MIT License
55 stars 0 forks source link

low R@1 performance #1

Closed Jeonghoon4 closed 7 months ago

Jeonghoon4 commented 9 months ago

When using the train_msrvtt.sh code for reproduction, the performance shows a lower R@1 result compared to the paper. Is it common to observe significant performance variance in the original results?

[paper] Text-to-Video R@1: 49.4 - R@5: 72.1 MnR: 12.9 Video-to-Text R@1: 47.1 - R@5: 74.3

[reproduce] (1st run) Text-to-Video: R@1: 47.5 - R@5: 74.2 - R@10: 82.3 - Median R: 2.0 - Mean R: 13.0 Video-to-Text: V2T$R@1: 46.1 - V2T$R@5: 72.8 - V2T$R@10: 82.1 - V2T$Median R: 2.0 - V2T$Mean R: 10.2

(2nd run) Text-to-Video: R@1: 47.3 - R@5: 73.5 - R@10: 82.0 - Median R: 2.0 - Mean R: 13.1 Video-to-Text: V2T$R@1: 45.3 - V2T$R@5: 72.1 - V2T$R@10: 81.6 - V2T$Median R: 2.0 - V2T$Mean R: 10.3

Ziyang412 commented 9 months ago

Thank you for your attention to our work. Here are several possible reasons:

  1. The Text-to-Video results in the paper are obtained with SK-norm, so you should run the codes in "2) Evaluate text-to-video retrieval with SK-norm on MSR-VTT dataset" of ""train and inference"" after running the train_msrvtt.sh.

  2. The GPU type and Torch version could also affect the results, we are using 4 A5000 GPUs for training and PyTorch==1.12.1.

Hope it helps!

Jeonghoon4 commented 9 months ago

Thank you for the quick response.

  1. The previous results were obtained using the eval_msrvtt.sh script within the eval_t2v folder. While there was a slight improvement in performance compared to train_msrvtt.sh, there was a significant difference in R@1 performance compared to the paper. Additionally, when evaluating the performance using eval_v2t, I obtained results that were similar to the R@1 performance reported in the paper for v2t. Text-to-Video: R@1: 45.6 - R@5: 72.7 - R@10: 81.7 - Median R: 2.0 - Mean R: 13.4 Video-to-Text: V2T$R@1: 47.5 - V2T$R@5: 73.4 - V2T$R@10: 82.4 - V2T$Median R: 2.0 - V2T$Mean R: 12.5

  2. I conducted the experiments using four 3090 GPUs and the following environment: Python version 3.8, Torch version 1.12.1, and CUDA version 11.3.

I'm currently rechecking the code, dataset, and environment before proceeding with further experiments. If there are any other insights or issues, please do let me know. Thank you.

Ziyang412 commented 9 months ago

Thank you for letting me know more information about your experiments! I only tested on A5000 and A6000 GPUs with CUDA 11.6, the results were similar. Here is the log information for my last run:

2023-03-07 22:37:38,070:WARNING: Using patch shift! 2023-03-07 22:50:47,426:INFO: sim matrix size: 1000, 1000 2023-03-07 22:50:47,514:INFO: Length-T: 1000, Length-V:1000 2023-03-07 22:50:47,514:INFO: Text-to-Video: 2023-03-07 22:50:47,515:INFO: >>> R@1: 49.4 - R@5: 72.1 - R@10: 83.5 - Median R: 2.0 - Mean R: 12.9

Meanwhile, I uploaded my MSR-VTT checkpoint in https://drive.google.com/file/d/1-zynuNEI1u0TwNxw2R163kX4kocWDsIu/view?usp=sharing, hope it helps! If you can provide more information, I am more than happy to help you on this reproduction issue.

Best, Ziyang

Jeonghoon4 commented 9 months ago

Thank you so much for your assistance.

When I evaluated the your weight, I got the following results. I noticed a slight decrease in performance, but I cannot pinpoint the exact cause. Do you have any insight into why performance might change during evaluation?

Text-to-Video: R@1: 48.6 - R@5: 73.4 - R@10: 82.7 - Median R: 2.0 - Mean R: 12.7 Video-to-Text: V2T$R@1: 44.8 - V2T$R@5: 71.4 - V2T$R@10: 81.9 - V2T$Median R: 2.0 - V2T$Mean R: 9.1

Also, the epoch of the best model is 7, and the paper mentions that training was done with epoch=8 on the MSR-VTT dataset. So I reduced the epoch to 8 and 5 during training and got slightly better performance. Below are the results when training with epoch=5(best=4). I think that by adjusting the random seed a bit I could get similar performance.

Text-to-Video: R@1: 48.2 - R@5: 72.4 - R@10: 82.7 - Median R: 2.0 - Mean R: 12.2 Video-to-Text: V2T$R@1: 45.0 - V2T$R@5: 73.2 - V2T$R@10: 82.4 - V2T$Median R: 2.0 - V2T$Mean R: 8.7

Ziyang412 commented 9 months ago

Thanks for the update!

When I evaluated your weight, I got the following results. I noticed a slight decrease in performance, but I cannot pinpoint the exact cause. Do you have any insight into why performance might change during evaluation?

Well, I am trying to understand this difference from the package/ hardware that we have. I never tested on 3090 and did not have access to any, but I don't expect this factor to cause a 0.8% R@1 GAP. CUDA version is another issue since the calculation might be a variant. Maybe try some new GPUs with higher CUDA versions if possible?

Also, the epoch of the best model is 7, and the paper mentions that training was done with epoch=8 on the MSR-VTT dataset. So I reduced the epoch to 8 and 5 during training and got a slightly better performance.

Well, it's expected that different GPU/CUDA versions/ package versions could lead to different optimization, I would recommend you try more hyper-parameter tunings here (e.g., try using epochs=15, or maybe try different lr ).

Hope it helps!

Jeonghoon4 commented 9 months ago

Thank you very much for your help. I got similar results when evaluating by using original videos (not compressed videos).

Below are the results I obtained using provided weights using 3090 and A6000. Although they are different servers, I set up the same environment for both (CUDA=11.6 and other libraries).

[train-compressed / test-compressed] (3090) Text-to-Video: R@1: 48.6 - R@5: 73.4 - R@10: 82.7 - Median R: 2.0 - Mean R: 12.7

(A6000) Text-to-Video: R@1: 48.6 - R@5: 73.4 - R@10: 82.7 - Median R: 2.0 - Mean R: 12.7

The results are the same. Therefore, this is not a GPU issue. While looking for other issues, I rechecked the GitHub code and found that both train_msrvtt.sh and eval_msrvtt.sh had their default paths set to videos/all. So, when I tried evaluating with original videos during the evaluation, using the same weights as mentioned above, I obtained the following results:

[train-compressed / test-original] (3090 & A6000) Text-to-Video: R@1: 49.6 - R@5: 73.5 - R@10: 82.8 - Median R: 2.0 - Mean R: 12.3

This also resulted in a slight difference compared to the provided log (49.4 <-> 49.6). Below is the model that I reproduced earlier. I trained it with compressed videos in train_msrvtt.sh and evaluated it with original videos (epoch=15 with 3090). Additionally, I also tried training with the original size, but it showed lower results compared to training with compressed videos.

[train-compressed / test-original] Text-to-Video: R@1: 48.9 - R@5: 74.2 - R@10: 82.3 - Median R: 2.0 - Mean R: 12.6

[train-original / test-original] Text-to-Video: R@1: 47.8 - R@5: 72.2 - R@10: 82.7 - Median R: 2.0 - Mean R: 11.9

Using original videos for evaluation doesn't always result in improved performance. In many cases, the performance can remain the same or even decrese. When conducting the experiments, did you use compressed videos for both training and testing, or did you use compressed videos for training and original videos for testing? Could you please clarify?

Thank you.

Ziyang412 commented 9 months ago

Thanks for your update and such comprehensive results on our project!

As mentioned in the README, we compress all the videos to 3224224, both training and testing phases. For the first issue, it's pretty weird but I will rerun the code soon and try to find any clue for this.

Thank you.