FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
https://groma-mllm.github.io/
Apache License 2.0
544 stars 57 forks source link

Is it possible to fine-tune the third stage using only one 24GB GPU? #28

Closed Aoihigashi closed 4 weeks ago

Aoihigashi commented 4 weeks ago

Thank you very much for sharing this work. I would like to fine-tune the third stage of the LLM using my own instruction dataset on a single 24GB 4090. Is there any way to achieve this, or is it an impossible task? Thank you.

machuofan commented 4 weeks ago

Hi, thanks for your interest in our work. I'm afraid that a single 24GB 4090 is not enough for finetuning. Finetuning MLLM typically relies on FSDP or deepspeed to reduce memory usage on single GPU. Thus, I suggest using multiple GPUs if possible. For example, a 7B LLaVA model can be finetuned on 8-RTX3090s using LoRA.