farewellthree / STAN

Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
Apache License 2.0
90 stars 3 forks source link

Reproduce on DiDeMo dataset #18

Open dengrui-64 opened 6 months ago

dengrui-64 commented 6 months ago

Hi, we appreciate your two papers and have thoroughly examined them.

The replication process for the MSRVTT results on Mug-STAN was successful, yielding outcomes that closely align with the paper's findings.

However, we encountered some difficulties while attempting to replicate the DiDeMo dataset. Our achieved scores were only 46.3% on R@1 and 72.4% on R@5, both of which fall short of the reported results in the paper (49.6% on R@1 and 75.3% on R@5).

Here are our reproduced results. Can you give me some advice on how to attain the desired results?

Results:

dengrui-64 commented 6 months ago
farewellthree commented 6 months ago

Didemo may need more GPUs to keep the batch size as 128. Are the frame number (64) and batch size (128) both right?

dengrui-64 commented 6 months ago

Thank you for your reply. I have reviewed my experiment configurations on DiDeMo and ensured the use of batch_size=128. Specifically, I modified the training batch size to 16 and utilized 8 GPUs. Moreover, I observed that gradient_checkpointing is set to True in mugstan_didemo_b32_hf.py(Line6). Will this parameter have an impact on the results?

farewellthree commented 6 months ago

Theoretically no effect. Is the testing split right? In previous work, it seems that finetuning and zero-shot testing use different splits. See https://github.com/OpenGVLab/unmasked_teacher/blob/main/multi_modality/DATASET.md

dengrui-64 commented 6 months ago

Thank you for your prompt response. Would it be possible for you to provide us with your annotation files? This will allow us to align the results accurately.