Closed yusuke-coder closed 6 months ago
Could you tell me which target_modules
we should train when we want to fine-tune the model?
I want to fine-tune the CLIP, visual encoder part too.
I'm making a script to train the model. I'm using the implementation of
MLlavaProcessor()
in this repo. And I got this message,UserWarning: None of the inputs have requires_grad=True. Gradients will be None
Could someone help with this?
I think you are training with lora? Please refer to this PR for the warning. I think you can just ifgnore it. https://github.com/kohya-ss/sd-scripts/issues/323
Could you tell me which
target_modules
we should train when we want to fine-tune the model? I want to fine-tune the CLIP, visual encoder part too.
We disable the tuning of vision_tower at this line:
You can try modifying it to enable the tuning.
For which target_modules
of lora to train, we normally apply all the linear layers, these layers are find via a function here
Hope these information helpful!
Thank you for a lot of your help. @jdf-prog
Please tell me your thoughts about the following papers which are not mentioned in your paper.
VILA: On Pre-training for Visual Language Models - This paper also has an interleave concept, what is the difference from Mantis?
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs - Why Mantis does not concat multiple images and give them "marks"?
I have another new issue from trainer.train()
with your latest Huggingface model when I fine-tune your pre-trained model with LoRA.
TypeError: device() received an invalid combination of arguments - got (NoneType), but expected one of:
* (torch.device device)
didn't match because some of the arguments have invalid types: (!NoneType!)
* (str type, int index)
I have following setting for LoRA.
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
)
torch.cuda.empty_cache()
torch.backends.cuda.enable_mem_efficient_sdp(False)
model = LlavaForConditionalGeneration.from_pretrained(tokenizer_model_id,
quantization_config=quantization_config,
device_map="cuda",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
# use_flash_attn=True, # bool
token=access_token,
offload_state_dict = True
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
# eval_dataset=eval_dataset,
peft_config=lora_config,
dataset_text_field="text", # need a dummy field
tokenizer=tokenizer,
data_collator=data_collator,
dataset_kwargs={"skip_prepare_dataset": True},
max_seq_length = 512
)
VILA is based on pre-training on MM-C4, which is 1000x larger than our dataset. Our paper described it quite clearly that our method is way more efficient than the others.
I'm making a script to train the model. I'm using the implementation of
MLlavaProcessor()
in this repo. And I got this message,UserWarning: None of the inputs have requires_grad=True. Gradients will be None
Could someone help with this?