About MSRVTT experiment's Detail

LiuRicky / ts2_net

[ECCV2022] A pytorch implementation for TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

76 stars 9 forks source link

About MSRVTT experiment's Detail #2

Closed kang7734 closed 1 year ago

kang7734 commented 2 years ago

hello! First of all, thank you for sharing your good research. Currently, I tested the model with 4 GPUs with 64 batch size / top_k = 3, and R@1 came out 45.9.

Other than changing the batch_size to 128, top_k is the content that was executed after git clone as it is. If I change it to 4 as in the article, can I get the right performance?

or this ts2net must preprocess to resizing raw video in 224x224? can i just input raw video?

LiuRicky commented 2 years ago

Thanks for your attention to our work and thanks for your question. Surely there're some things affect the result, such as batch size, input format, torch version, opencv version, GPU type, etc. Our experiments are done on 8 V100 GPUs and use torch 1.7.0, and the input is resizing video. I think in your implementation, the small batch size will affect the result significantly. Because video retrieval task relies on cross-entropy loss. And then bigger batch size means bigger denominator, which would make the net learns better. And in our exploration, we find bigger batch size do improve the performance. Thus you can try a bigger batch size. Below we show our train log of the best ckpt in MSRVTT using ViT-B/32.

LiuRicky commented 2 years ago

Please refer to upper comment. We show the best performance without DSL.

kang7734 commented 2 years ago

Currently, we have trained the model with 8 GPUs in 128 Batch size. After the pre-process, R@1 is getting 46.1 performance even though trained as raw git_clone state. Is there a problem?

LiuRicky commented 2 years ago

There is no need to change top-K to 4 because as we said in comment "in practice, we manually set CLS as the most important". About the reimplementation, what's your GPU type and torch version? Our experiment environment is V100+torch1.7.0, and we also do experiments on A100+torch1.10.0. The R@1 results are almost around 47.0 (A100 is much higher and reaches 47.4 on R@1). I think you can figure out the difference between our and your environments. Maybe the pre-processing does affect the performance? Besides, GPU type does affect parameter initialization, you can refer to https://pytorch.org/docs/stable/notes/randomness.html

kang7734 commented 2 years ago

Currently, our experimental environment uses 8 Nvidia 2080Ti + torch 1.7.1. There seems to be a difference in GPU type or a mistake in pre-processing.

Lastly, you told me that I don't need to change the top-k, but there is a performance change when I change the top-k manually. How should I interpret this part?

Thank you very much for your reply. Once again, thank you for sharing your good research.

JackSparrow3 commented 2 years ago

Thanks for your work! I tried to reproduce the result on MSRVTT, but still get the R@1 45.6 performance with VIT-B/32. my experiment environment is V100+torch1.7.1 with default seting in /script/run_msrvtt.sh. Is it possible that there is some difference in the hyperparameter settings between us? Thanks again.

kang7734 commented 2 years ago

Hi, we can get 46.4 performance with ViT-B/32. it works Pre-Processing referring https://github.com/yuqi657/ts2_net/blob/master/preprocess/compress_video.py to MSR-VTT Data-set

but, we still can not get 47.0% if you can get 47.0% please Reply to me. Thanks

JackSparrow3 commented 2 years ago

Hi, I try the Pre-Processing you mentioned , and get R@1 46.3%, still less than 47%.....

kang7734 commented 2 years ago

me too, i got 46.4%.. i wait for authors answer.. thank you

JackSparrow3 commented 2 years ago

waiting for authors answer too.....

LiuRicky commented 2 years ago

Sorry to reply late. I'm trapped in things these days. Since I cannot access to your environment, it is hard for me to detect where the problem is. What I can do is provide my environment details. I will update the project with my training logs and ckpts, along with the docker file of our environment. Sorry I am busy these days and it takes time for me to organize them. And then I suppose you can compare the details with your own environment and detect the difference.

Alen-T commented 2 years ago

We used 4 3090Ti in MSR-VTT and the experimental result： 2022-09-16 14:56:15,685:INFO: Text-to-Video: 2022-09-16 14:56:15,685:INFO: >>> R@1: 46.1 - R@5: 73.1 - R@10: 83.0 - R@50: 96.1 - Median R: 2.0 - Mean R: 12.8 2022-09-16 14:56:15,685:INFO: Video-to-Text: 2022-09-16 14:56:15,685:INFO: >>> V2T$R@1: 46.0 - V2T$R@5: 73.8 - V2T$R@10: 83.9 - V2T$R@50: 96.4 - V2T$Median R: 2.0 - V2T$Mean R: 8.7

LiuRicky commented 2 years ago

Sorry, I missed this question before. Changing top-K surely affect the performance. We have studied that increasing K in a proper range does improve the performance.

LiuRicky commented 2 years ago

For docker image and dependency, please refer to issue 3. Thanks for your attention. If you still cannot fully reimplement the work, maybe there is something to do with the video format, etc. And one practice is, in your work, you can cite the number reimplemented in your environment, and comparing with your gain.

Ziyang412 commented 1 year ago

I received the similar results using 4 A6000 GPU: R@1: 46.1 - R@5: 73.9 - R@10: 82.8 - R@50: 94.5 - Median R: 2.0 - Mean R: 13.9, I tried to use the exact same settings with the original ones, so the problem could be GPU type?