TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models
https://arxiv.org/abs/2402.14289
Apache License 2.0
658 stars 68 forks source link

consuming time for each VLM pretrain/finetune #67

Closed vision-time closed 5 months ago

vision-time commented 5 months ago

Could you please provide the consuming time for each VLM pretrain/finetune? for example, all experiments are conducted on 8x A100(40G). Thanks a lot.

YingHuTsing commented 5 months ago

Hi. The running time for different language models and different visual encoders varies a bit. SigLIP takes longer time than CLIP-vit-L-p14. Larger LLMs take longer time.

Take 4*A100-40G using flash-attention and tuning connector only in pretraining and connector+LLM in finetuning for example: For the LLaVA1.5(base) dataset, tinyllama-1.1B+siglip-384 takes about 2.5 hours in pretraining and 6.5 hours in finetuning. For the LLaVA1.5(base) dataset, stablelm-2-1.6B+siglip-384 takes about 2.6 hours in pretraining and 7.5 hours in finetuning. For the LLaVA1.5(base) dataset, phi2-2.7B+siglip-384 takes about 5 hours in pretraining and 14 hours in finetuning. For the LLaVA1.5(base) dataset, phi2-2.7B+clip-l-336 takes about 4 hours in pretraining and 13 hours in finetuning. For the share dataset, phi2-2.7B+siglip takes about 1 and half days. For openelm-450M, it is a bit different because it does not support flash-attention. It takes around the same time as tinyllama.

vision-time commented 5 months ago

Thanks for your explanation.