Cannot replicate text-to-video R@1 results on MSR-VTT dataset

jzk66596 commented 2 years ago

Dear authors,

First of all, thanks a lot for the great work and also for sharing the code. I know there have been a long discussion on going about replicating the experiment results on MSR-VTT data, but I do want to create a new thread to make things clear, especially on the configurations to replicate the results mentioned in the paper.

So far we have tried with many different hyper-parameter settings (such as batch size changing, video compression with different fps, different values of top-k frame token selection, random seeds, etc.), and the best text-to-video R@1 result we can get is 46.5 and 47.9 for Vit-B/32 and Vit-B/16 respectively, which is far from the paper reported 47.0 and 49.4.

I understand there are many factors which can affect the model training, hence the final evaluation results. Still, could you please share your configs, which includes the shell scripts for hyper-parameters for the best R@1, the video compression commands, and further the model checkpoint for the best evaluation epoch(s) if possible? We would like to use the exact same setting to understand where the problem could be.

Thanks again for your help, and looking forward to your reply.

LiuRicky commented 2 years ago

Part of Docker File 1, This is for CUDA 11.1: RUN source ~/.bashrc && conda activate env-3.8.8 \ && pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html \ && pip install timm==0.4.12 transformers==4.15.0 fairscale==0.4.4 pycocoevalcap decord \ && conda install -y ruamel_yaml \ && pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn \ && pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg \ && pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0 \ && pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa \ && pip install gpustat einops ftfy boto3 pandas \ && pip install git+https://github.com/openai/CLIP.git

LiuRicky commented 2 years ago

Part of Docker File 2, this is for cuda 10.2: RUN source ~/.bashrc && conda activate env-3.6.8 \ && pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 tensorflow==2.3.0 transformers==4.15.0 mxnet==1.9.0 \ && pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn \ && pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg \ && pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0 \ && pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa \ && pip install line_profiler imagehash cos-python-sdk-v5 thop einops timm pycm \ && pip install moviepy openpyxl lmdb \ && pip install qqseg==1.14.1 jieba \ && pip install ftfy regex tqdm scipy opencv-python boto3 requests pandas

Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32).

jzk66596 commented 2 years ago

Part of Docker File 2, this is for cuda 10.2: RUN source ~/.bashrc && conda activate env-3.6.8 && pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 tensorflow==2.3.0 transformers==4.15.0 mxnet==1.9.0 && pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn && pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg && pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0 && pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa && pip install line_profiler imagehash cos-python-sdk-v5 thop einops timm pycm && pip install moviepy openpyxl lmdb && pip install qqseg==1.14.1 jieba && pip install ftfy regex tqdm scipy opencv-python boto3 requests pandas

Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32).

Thanks a lot for your response and sharing the setting. Really appreciate your help. In addtion to those, could you please also share the config for Vit-B/16 experiments for MSR-VTT?

Specifically, (1) what is the video compression fps? (2) training batch size, max frames, max words? (3) did you use multiple computation nodes with >8 GPUs to finish Vit-B/16 training? It looks like Vit-B/16 consumes a lot more GPU memory that could be very difficult to fit in a single node when paired with 8 GPUs.

Thanks again and looking forward to your reply.

tiesanguaixia commented 1 year ago

Part of Docker File 2, this is for cuda 10.2: RUN source ~/.bashrc && conda activate env-3.6.8 && pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 tensorflow==2.3.0 transformers==4.15.0 mxnet==1.9.0 && pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn && pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg && pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0 && pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa && pip install line_profiler imagehash cos-python-sdk-v5 thop einops timm pycm && pip install moviepy openpyxl lmdb && pip install qqseg==1.14.1 jieba && pip install ftfy regex tqdm scipy opencv-python boto3 requests pandas Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32).

Thanks a lot for your response and sharing the setting. Really appreciate your help. In addtion to those, could you please also share the config for Vit-B/16 experiments for MSR-VTT?

Specifically, (1) what is the video compression fps? (2) training batch size, max frames, max words? (3) did you use multiple computation nodes with >8 GPUs to finish Vit-B/16 training? It looks like Vit-B/16 consumes a lot more GPU memory that could be very difficult to fit in a single node when paired with 8 GPUs.

Thanks again and looking forward to your reply.

Hi! I also want to ask this question, did you have a answer?

LiuRicky commented 1 year ago

Part of Docker File 2, this is for cuda 10.2: RUN source ~/.bashrc && conda activate env-3.6.8 && pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 tensorflow==2.3.0 transformers==4.15.0 mxnet==1.9.0 && pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn && pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg && pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0 && pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa && pip install line_profiler imagehash cos-python-sdk-v5 thop einops timm pycm && pip install moviepy openpyxl lmdb && pip install qqseg==1.14.1 jieba && pip install ftfy regex tqdm scipy opencv-python boto3 requests pandas Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32).

Thanks a lot for your response and sharing the setting. Really appreciate your help. In addtion to those, could you please also share the config for Vit-B/16 experiments for MSR-VTT? Specifically, (1) what is the video compression fps? (2) training batch size, max frames, max words? (3) did you use multiple computation nodes with >8 GPUs to finish Vit-B/16 training? It looks like Vit-B/16 consumes a lot more GPU memory that could be very difficult to fit in a single node when paired with 8 GPUs. Thanks again and looking forward to your reply.

Hi! I also want to ask this question, did you have a answer?

Thanks for your attention. 1) Video compress fps = 3. 2) I think all settings you have asked could be find in paper's implement detail part. And you could check this part for more details. Such as training batch size is 128. 3) Sure you can use multiple nodes when train vit-b/16. But a node with 8 large memory GPUs could also finish.

LiuRicky / ts2_net

Cannot replicate text-to-video R@1 results on MSR-VTT dataset #3