Reproducing results for different datasets

Jeonghoon4 commented 10 months ago

Hello. Thank you for sharing your interesting research. I was able to easily reproduce the results of the DiDeMo dataset by demo script (run_didemo.sh). However, when attempting to reproduce the MSR-VTT-9k dataset, I utilized the bellow bash code with dataloader_msrvtt_retrieval.py, taking reference from CLIP4Clip and TS2-Net. Unfortunately, I obtained results lower than the values reported in the paper. After experimenting with five different seed values, I achieved R1 scores ranging from 44.8 to 47.0, while the paper reported 48.2. Could you provide demo scripts for different datasets?

DATA_PATH=/data4/datasets/videos/MSRVTT
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main_retrieval.py --do_train --eval_in_train --num_thread_reader=8 --seed 0  \
--epochs=5 --batch_size=128 --n_display=50 \
--train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \
--val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \
--data_path ${DATA_PATH}/MSRVTT_data.json \
--features_path ${DATA_PATH}/compressed_videos_224_fps3 \
--output_dir ckpts/reproduce/msrvtt \
--datatype msrvtt --expand_msrvtt_sentences \
--cross_num_hidden_layers 4 \
--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 4 \
--feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header seqTransf \
--pretrained_clip_name ViT-B/32 --max_patch 12 --max_word_pro 28

IMCCretrieval commented 9 months ago

Thank you for your attention. It is true that factors such as random seeds and experimental environment will affect the results of the model. When we organize experiments on the MSR-VTT dataset, the random seed is 42 and the mask ratio in the frame decoder is 0.5.

lucas0214 commented 5 months ago

你好。感谢您分享您有趣的研究。我能够通过演示脚本（run_didemo.sh）轻松重现 DiDeMo 数据集的结果。但是，在尝试重现 MSR-VTT-9k 数据集时，我参考了 CLIP4Clip 和 TS2-Net，使用了带有 dataloader_msrvtt_retrieval.py 的 bowlow 代码。不幸的是，我得到的结果低于论文中报告的值。在尝试了五种不同的种子值后，我获得了 44.8 到 47.0 的 R1 分数，而论文报告的分数为 48.2。您能为不同的数据集提供演示脚本吗？
DATA_PATH=/data4/datasets/videos/MSRVTT
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main_retrieval.py --do_train --eval_in_train --num_thread_reader=8 --seed 0  \
--epochs=5 --batch_size=128 --n_display=50 \
--train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \
--val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \
--data_path ${DATA_PATH}/MSRVTT_data.json \
--features_path ${DATA_PATH}/compressed_videos_224_fps3 \
--output_dir ckpts/reproduce/msrvtt \
--datatype msrvtt --expand_msrvtt_sentences \
--cross_num_hidden_layers 4 \
--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 4 \
--feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header seqTransf \
--pretrained_clip_name ViT-B/32 --max_patch 12 --max_word_pro 28
Hello, why can't I find these json files： video_json_path_dict["train"] = os.path.join(self.data_path, "train_data_mp4.json") video_json_path_dict["val"] = os.path.join(self.data_path, "test_data_mp4.json") video_json_path_dict["test"] = os.path.join(self.data_path, "test_data_mp4.json")

Jeonghoon4 commented 5 months ago

Hello, I got json files in DiDeMo There are 3 json files(train_data.json, val_data.json, test_data.json).

Referring to the explanation "According to compress_video.py, we convert the original video to mp4 format and set fps to 3.", I compressed videos with mp4 and replaced all video name in json files. (ex: asdf1234.avi -> asdf1234.mp4) Also I renamed json files (ex: train_data.json -> train_data_mp4.json)

lucas0214 commented 5 months ago

OK. Thank you.

musicman217 commented 5 months ago

Hello. Thank you for sharing your interesting research. I was able to easily reproduce the results of the DiDeMo dataset by demo script (run_didemo.sh). However, when attempting to reproduce the MSR-VTT-9k dataset, I utilized the bellow bash code with dataloader_msrvtt_retrieval.py, taking reference from CLIP4Clip and TS2-Net. Unfortunately, I obtained results lower than the values reported in the paper. After experimenting with five different seed values, I achieved R1 scores ranging from 44.8 to 47.0, while the paper reported 48.2. Could you provide demo scripts for different datasets?
DATA_PATH=/data4/datasets/videos/MSRVTT
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main_retrieval.py --do_train --eval_in_train --num_thread_reader=8 --seed 0  \
--epochs=5 --batch_size=128 --n_display=50 \
--train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \
--val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \
--data_path ${DATA_PATH}/MSRVTT_data.json \
--features_path ${DATA_PATH}/compressed_videos_224_fps3 \
--output_dir ckpts/reproduce/msrvtt \
--datatype msrvtt --expand_msrvtt_sentences \
--cross_num_hidden_layers 4 \
--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 4 \
--feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header seqTransf \
--pretrained_clip_name ViT-B/32 --max_patch 12 --max_word_pro 28

hello，can you share what seeds you set in msrvtt. I reproduce this research with seed 0 or 42 but got highest r@1=46.3 with different version of pytorch, numpy and transformers

Jeonghoon4 commented 4 months ago

Hello. Thank you for sharing your interesting research. I was able to easily reproduce the results of the DiDeMo dataset by demo script (run_didemo.sh). However, when attempting to reproduce the MSR-VTT-9k dataset, I utilized the bellow bash code with dataloader_msrvtt_retrieval.py, taking reference from CLIP4Clip and TS2-Net. Unfortunately, I obtained results lower than the values reported in the paper. After experimenting with five different seed values, I achieved R1 scores ranging from 44.8 to 47.0, while the paper reported 48.2. Could you provide demo scripts for different datasets?
DATA_PATH=/data4/datasets/videos/MSRVTT
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main_retrieval.py --do_train --eval_in_train --num_thread_reader=8 --seed 0  \
--epochs=5 --batch_size=128 --n_display=50 \
--train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \
--val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \
--data_path ${DATA_PATH}/MSRVTT_data.json \
--features_path ${DATA_PATH}/compressed_videos_224_fps3 \
--output_dir ckpts/reproduce/msrvtt \
--datatype msrvtt --expand_msrvtt_sentences \
--cross_num_hidden_layers 4 \
--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 4 \
--feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header seqTransf \
--pretrained_clip_name ViT-B/32 --max_patch 12 --max_word_pro 28
hello，can you share what seeds you set in msrvtt. I reproduce this research with seed 0 or 42 but got highest r@1=46.3 with different version of pytorch, numpy and transformers

Hello, I previously reproduced the code but did not achieve the R1 performance of paper. From my experience with reproducing other papers, achieving the exact R1 performance is difficult. Therefore, I usually check if the reproduced results roughly aligns with the results based on R5, R10, and MeanR metrics. By changing the seed, I recall reaching similar or slightly lower performance. However, I do not have the experimental results as they were deleted.

musicman217 commented 4 months ago

Hello. Thank you for sharing your interesting research. I was able to easily reproduce the results of the DiDeMo dataset by demo script (run_didemo.sh). However, when attempting to reproduce the MSR-VTT-9k dataset, I utilized the bellow bash code with dataloader_msrvtt_retrieval.py, taking reference from CLIP4Clip and TS2-Net. Unfortunately, I obtained results lower than the values reported in the paper. After experimenting with five different seed values, I achieved R1 scores ranging from 44.8 to 47.0, while the paper reported 48.2. Could you provide demo scripts for different datasets?
DATA_PATH=/data4/datasets/videos/MSRVTT
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main_retrieval.py --do_train --eval_in_train --num_thread_reader=8 --seed 0  \
--epochs=5 --batch_size=128 --n_display=50 \
--train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \
--val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \
--data_path ${DATA_PATH}/MSRVTT_data.json \
--features_path ${DATA_PATH}/compressed_videos_224_fps3 \
--output_dir ckpts/reproduce/msrvtt \
--datatype msrvtt --expand_msrvtt_sentences \
--cross_num_hidden_layers 4 \
--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 4 \
--feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header seqTransf \
--pretrained_clip_name ViT-B/32 --max_patch 12 --max_word_pro 28
hello，can you share what seeds you set in msrvtt. I reproduce this research with seed 0 or 42 but got highest r@1=46.3 with different version of pytorch, numpy and transformers
Hello, I previously reproduced the code but did not achieve the R1 performance of paper. From my experience with reproducing other papers, achieving the exact R1 performance is difficult. Therefore, I usually check if the reproduced results roughly aligns with the results based on R5, R10, and MeanR metrics. By changing the seed, I recall reaching similar or slightly lower performance. However, I do not have the experimental results as they were deleted.

thanks for your reply, recently i have reproduced this work again, i adjusted the mask radio in frame decorder to 1.0 in msrvtt, which means each frame focuses on mostly all of object prototypes in the whole video, and it could reach to 47.4 on r@1 though still far from r@1=48 on 2xRTX 4090.

musicman217 commented 1 month ago

hello, recently i reproduced this code again, and i found that the r@1 could reach to 48.0 near the r@1=48.2 mentioned on the paper when batchsize is 64, mask ratio is 1 in MSRVTT and running on 2 rtx 4090 GPUs. Note that multi-gpu significantly influences the performance.

IMCCretrieval / ProST

Reproducing results for different datasets #2