NVlabs / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
1.97k stars 158 forks source link

Fine tuning and --evaluation_strategy argument #122

Open lyluh opened 2 months ago

lyluh commented 2 months ago

I'm trying to get fine-tuning working through the 3_sft.sh script but am encountering an error:

Traceback (most recent call last):
  File "/root/VILA/llava/train/train_mem.py", line 36, in <module>
    train()
  File "/root/VILA/llava/train/train.py", line 436, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2738, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2761, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1735, in forward
Traceback (most recent call last):
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/VILA/llava/model/language_model/llava_llama.py", line 133, in forward
    outputs = self.llm.forward(
TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'seqlens_in_batch'

I tried commenting out the seqlens_in_batch argument where self.llm.forward() is called and the script will work, but when i try to get the validation scores by setting --evaluation_strategy to something other than "no" I get a bunch of errors related to the dataloader and the dataset 'inputs':

Traceback (most recent call last):
  File "/root/VILA/llava/train/train_mem.py", line 36, in <module>
    train()
  File "/root/VILA/llava/train/train.py", line 436, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2262, in _maybe_log_save_evaluate
    dataset_metrics = self.evaluate(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3022, in evaluate
    output = eval_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3212, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3429, in prediction_step
    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2761, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/VILA/llava/model/language_model/llava_llama.py", line 102, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/root/VILA/llava/model/llava_arch.py", line 261, in prepare_inputs_labels_for_multimodal
    if vision_tower is None or images is None or input_ids.shape[1] == 1:
IndexError: tuple index out of range

Any suggestions?

Lyken17 commented 2 months ago

TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'seqlens_in_batch'

This error is usually caused by in-complete environement install. Please follow the instruction in environment_setup.sh to set up