farewellthree / STAN

Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
Apache License 2.0
98 stars 3 forks source link

Reproduce on DiDeMo dataset #18

Open dengrui-64 opened 10 months ago

dengrui-64 commented 10 months ago

Hi, we appreciate your two papers and have thoroughly examined them.

The replication process for the MSRVTT results on Mug-STAN was successful, yielding outcomes that closely align with the paper's findings.

However, we encountered some difficulties while attempting to replicate the DiDeMo dataset. Our achieved scores were only 46.3% on R@1 and 72.4% on R@5, both of which fall short of the reported results in the paper (49.6% on R@1 and 75.3% on R@5).

Here are our reproduced results. Can you give me some advice on how to attain the desired results?

Results:

dengrui-64 commented 10 months ago
farewellthree commented 10 months ago

Didemo may need more GPUs to keep the batch size as 128. Are the frame number (64) and batch size (128) both right?

dengrui-64 commented 9 months ago

Thank you for your reply. I have reviewed my experiment configurations on DiDeMo and ensured the use of batch_size=128. Specifically, I modified the training batch size to 16 and utilized 8 GPUs. Moreover, I observed that gradient_checkpointing is set to True in mugstan_didemo_b32_hf.py(Line6). Will this parameter have an impact on the results?

farewellthree commented 9 months ago

Theoretically no effect. Is the testing split right? In previous work, it seems that finetuning and zero-shot testing use different splits. See https://github.com/OpenGVLab/unmasked_teacher/blob/main/multi_modality/DATASET.md

dengrui-64 commented 9 months ago

Thank you for your prompt response. Would it be possible for you to provide us with your annotation files? This will allow us to align the results accurately.