Closed smuelpeng closed 7 months ago
Hi, thanks for your attention to our work. Could you provide more detail about the CUDA/Pytorch version and your GPU type? I am using 4A5000 for my best experiments (and the Pytorch version provided in the repo). Also, please check whether your video is compressed to 3FPS 224224. Hope it helps!
-Ziyang
I compressed the video with a script found at https://github.com/Ziyang412/UCoFiA/blob/main/train/preprocess/compress_video.py#L26. This script compresses the video to a width of 224 pixels along its shorter side, rather than adjusting the video to a uniform resolution of 224x224. I'm not sure whether this method is appropriate.
For my experiments, I worked with four 3090 GPUs in an environment that included Python version 3.8, Torch version 1.12.1, and CUDA version 11.6.
I think the problem is the GPU, I saw from another repo that 20xx/30xx have less performance than A-series. I am using 4* NVIDIA A5000 GPUs for all my exps. High recommend checking on other servers. Sorry for that, and hope it helps.
I gave another try on a configuration of 4 A800 GPUs and obtained a result of 48.3, but there's still some gap from 49.4. Additionally, while testing the provided pytorch_model.bin.5, I encountered several error messages, which leads me to suspect that there might be missing elements in the algorithmic details. I have attached a screenshot for your reference. I wonder if there are some algorithmic details that were missed. I will try it again on A5000, still grateful for your contribution to the community!
Hey, thanks for the detailed investigation of our work. I think I probably know where the problem is and updated both differential_topk.py under /train and /eval_v2t root. Please check it out with the latest version and let me know whether it fixed the problem. Thanks.
Dear Ziyang
Firstly, I'd like to express my appreciation for your work on UCoFiA, which has been highly inspiring. However, I've encountered a discrepancy in the reproduction performance, particularly with the R@1 results, which are lower than those reported in your paper.
Using the provided weights (link), my Text-to-Video R@1 result on original videos is 49.6%. This contrasts with the results obtained during training, where the best checkpoint achieved a R@1 of 47.6%, and testing on this checkpoint at the end of training yielded 46.6%. Further testing using the eval_msrvtt.sh script from the eval_t2v folder resulted in a R@1 of 47.5%.
I have attempted to address this issue by following the recommendations in this GitHub issue, including using the 'test-original' setting and adjusting the training epochs to 5, 8, and 15. Unfortunately, these adjustments did not resolve the discrepancy.
Could you provide any guidance or suggestions on how to align the reproduction results more closely with those reported in your paper? Any advice or additional troubleshooting steps would be greatly appreciated.
Thank you for your time and assistance.
Best regards,
Zhipeng