TIGER-AI-Lab / Mantis

Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)
https://tiger-ai-lab.github.io/Mantis/
Apache License 2.0
187 stars 15 forks source link

UserWarning: None of the inputs have requires_grad=True. Gradients will be None #2

Closed yusuke-coder closed 6 months ago

yusuke-coder commented 7 months ago

I'm making a script to train the model. I'm using the implementation of MLlavaProcessor() in this repo. And I got this message, UserWarning: None of the inputs have requires_grad=True. Gradients will be None Could someone help with this?

yusuke-coder commented 7 months ago

Could you tell me which target_moduleswe should train when we want to fine-tune the model? I want to fine-tune the CLIP, visual encoder part too.

jdf-prog commented 7 months ago

I'm making a script to train the model. I'm using the implementation of MLlavaProcessor() in this repo. And I got this message, UserWarning: None of the inputs have requires_grad=True. Gradients will be None Could someone help with this?

I think you are training with lora? Please refer to this PR for the warning. I think you can just ifgnore it. https://github.com/kohya-ss/sd-scripts/issues/323

jdf-prog commented 7 months ago

Could you tell me which target_moduleswe should train when we want to fine-tune the model? I want to fine-tune the CLIP, visual encoder part too.

We disable the tuning of vision_tower at this line:

You can try modifying it to enable the tuning.

For which target_modules of lora to train, we normally apply all the linear layers, these layers are find via a function here

Hope these information helpful!

yusuke-coder commented 7 months ago

Thank you for a lot of your help. @jdf-prog

Please tell me your thoughts about the following papers which are not mentioned in your paper.

VILA: On Pre-training for Visual Language Models - This paper also has an interleave concept, what is the difference from Mantis?

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs - Why Mantis does not concat multiple images and give them "marks"?

I have another new issue from trainer.train() with your latest Huggingface model when I fine-tune your pre-trained model with LoRA.

TypeError: device() received an invalid combination of arguments - got (NoneType), but expected one of:
 * (torch.device device)
      didn't match because some of the arguments have invalid types: (!NoneType!)
 * (str type, int index)

I have following setting for LoRA.

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
)

torch.cuda.empty_cache()
torch.backends.cuda.enable_mem_efficient_sdp(False)
model = LlavaForConditionalGeneration.from_pretrained(tokenizer_model_id,
                                                      quantization_config=quantization_config,
                                                      device_map="cuda",
                                                      torch_dtype=torch.bfloat16,
                                                      attn_implementation="flash_attention_2",
                                                      # use_flash_attn=True, # bool
                                                      token=access_token,
                                                      offload_state_dict = True
                                                      )

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
#     eval_dataset=eval_dataset,
    peft_config=lora_config,
    dataset_text_field="text",  # need a dummy field
    tokenizer=tokenizer,
    data_collator=data_collator,
    dataset_kwargs={"skip_prepare_dataset": True},
    max_seq_length = 512
)
wenhuchen commented 6 months ago

VILA is based on pre-training on MM-C4, which is 1000x larger than our dataset. Our paper described it quite clearly that our method is way more efficient than the others.