PKU-YuanGroup / Chat-UniVi

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
https://arxiv.org/abs/2311.08046
Apache License 2.0
755 stars 41 forks source link

Inference Time Issue #32

Closed HYUNJS closed 4 months ago

HYUNJS commented 5 months ago

Appreciate your efforts in maintaining this project!

While I ran the zero-shot VQA inference (generating results) on the MSRVTT dataset, it took 28 hours (using 4 A5000) to finish. I recognize that it is caused by too many video-question pairs (~70K), but have you solved this problem by implementing a better dataloader? Otherwise, have you experimented with small subset during the development?

Also, I have a minor question why zero2 setting is used for fine-tuning, instead of zero3, compared to the pre-training stage used zero3. This is reserve setting of LLaVA which used zero2 for pre-training and zeor3 for fine-tuning.

And, may I ask for memory consumption during fine-tuning the 7B model since even a batch size of 1 is not enough with using 4 A100 40GB. In the case of using lora for fine-tuning, may I know the configuration that you used, (e.g., lora_r, lora_alpha, etc., and whether the same learning rate was used for mm_projector)?

Thanks!

jpthu17 commented 4 months ago
HYUNJS commented 4 months ago

I see. Thank you for your answer!