LLaVA-VL / LLaVA-NeXT

Apache License 2.0
2.4k stars 167 forks source link

LoRA Training Script for Model Based on finetune_onevision #203

Closed YoungjaeDev closed 5 days ago

YoungjaeDev commented 1 week ago

I want to train a custom model using LoRA based on the model trained with the finetune_onevision script.

  1. Is there a separate LoRA training script available for this purpose?
  2. If not, is there any guidance on how to modify the finetune_onevision script for LoRA training (including train param tips)?

Thank you!

YerongLi commented 1 week ago

I think it is not difficult to add loraConfig, by the way did you get the finetuning script work with its data.yaml file?https://github.com/LLaVA-VL/LLaVA-NeXT/issues/182#issuecomment-2311573430

YoungjaeDev commented 1 week ago

@YerongLi

A recent commit seems to have put a yaml file in the script folder

YerongLi commented 1 week ago

I found it as well. Thanks. I find --lora_enable True \ will enable Lora, they have a lora branch in train.py

YerongLi commented 1 week ago

I find even with lora, the training script run into OOM with 48 GB memory. --lora_r =4, --max_length=128

trainable params: 10,811,392 || all params: 8,041,160,224 || trainable%: 0.1344506476532061
YoungjaeDev commented 1 week ago

I find even with lora, the training script run into OOM with 48 GB memory. --lora_r =4, --max_length=128

trainable params: 10,811,392 || all params: 8,041,160,224 || trainable%: 0.1344506476532061

Is SWIFT not supporting LLAVA-OV LORA fine-tuning?

YerongLi commented 1 week ago

I find even with lora, the training script run into OOM with 48 GB memory. --lora_r =4, --max_length=128

trainable params: 10,811,392 || all params: 8,041,160,224 || trainable%: 0.1344506476532061

Is SWIFT not supporting LLAVA-OV LORA fine-tuning?

Their original code only used the PEFT for Lora sft. Let me try SWIFT. This is very new to me..

YoungjaeDev commented 1 week ago

@YerongLi Is this done by customizing the dataset, and if so, how did you configure it?

YerongLi commented 1 week ago

@YerongLi Is this done by customizing the dataset, and if so, how did you configure it?

Which step are you talknig about? I used a subset which their training flow is using, https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data

trainable params: 10,811,392 || all params: 8,041,160,224 || trainable%: 0.1344506476532061

One thing I don't understanding is that with 1B trainable parameters, or even 3 million trainable, why I still run into OOM error.

hcwei13 commented 1 week ago

@YerongLi这是通过自定义数据集完成的吗?如果是,您是如何配置它的?

你指的是哪一步?我使用了他们的训练流程使用的子集,https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data

trainable params: 10,811,392 || all params: 8,041,160,224 || trainable%: 0.1344506476532061

我不明白的一件事是,有了 1B 可训练参数,甚至 300 万个可训练参数,为什么我仍然会遇到 OOM 错误。

I'm still clueless about how to fine-tune LLaVA-onevision. Could you share a training script with me? thanks.

countytown commented 1 week ago

I find even with lora, the training script run into OOM with 48 GB memory. --lora_r =4, --max_length=128

trainable params: 10,811,392 || all params: 8,041,160,224 || trainable%: 0.1344506476532061

I used --deepspeed scripts/zero3_offload.json with LoRA tuning (--lora_r 128 --lora_alpha 256). The OOM issue is resolved, and training can proceed, but there is an error when saving the final model. I am still working on it. BTW, I train with a single A100_40G GPU.

YerongLi commented 6 days ago

I find even with lora, the training script run into OOM with 48 GB memory. --lora_r =4, --max_length=128

trainable params: 10,811,392 || all params: 8,041,160,224 || trainable%: 0.1344506476532061

I used --deepspeed scripts/zero3_offload.json with LoRA tuning (--lora_r 128 --lora_alpha 256). The OOM issue is resolved, and training can proceed, but there is an error when saving the final model. I am still working on it. BTW, I train with a single A100_40G GPU.

Do we really have to use the zero3_offload.json, this must be very slow. One thing strange to me is that here our total number of trainables are 1B, I even managed to reduce the number of the trainables to 300M and OOM persists.