Open hill2hill opened 2 weeks ago
Thank you for your achievements. However, when image-text pair data and text-only data were included in the same batch, the following error occurred when running the code.
'''
Traceback (most recent call last):
File "/workspace/VLM/Mars/finetune/finetune.py", line 250, in
Here's the situation: Whenever someone update the code on GitHub, this error inevitably occurs because the text data lacks corresponding tgt_sizes and cannot participate in the process of extracting image features. This part is defined within the Hugging Face model, not in this current repository. We need to add an additional precondition: as I mentioned here. We should add two lines there.
As the huggingface merge has not been accepetd by official, we can only modify the code localy.
Thanks for your fast reply, but i got same error.. here is my compute loss code in local, maybe it is old version.
def compute_loss(self, model, inputs, return_outputs=False):
if "labels" in inputs:
labels = inputs.pop("labels")
else:
labels = None
vllm_embedding, vision_hidden_states = self.model.get_vllm_embedding(inputs)
outputs = self.model.llm(
inputs_embeds=vllm_embedding,
use_cache=False,
)
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
logits = outputs.logits.view(-1, self.model.config.vocab_size).contiguous()
labels = labels.view(-1).long().contiguous()
labels = labels.to(logits.device)
loss = loss_fct(logits, labels)
else:
if isinstance(outputs, dict) and "loss" not in outputs:
raise ValueError(
"The model did not return a loss from the inputs, only the following keys: "
f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
)
loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
return (loss, outputs) if return_outputs else loss
So sorry, I actually not familiar with the previous version code, maybe you can try current version. But it looks that your compute_loss function is fine, the error should only happen inside the self.model.get_vllm_embedding(inputs)
oh i just notice that you load model from cache? maybe git clone the model first and then use your local model_path, it will be easy to modify code and debug.
oh, that's ok. thank you so much for your help.
The current logic assumes that all input data includes image inputs, so data['pixel_values'] must match the training samples; however, if dealing with purely text data inputs, 'pixel_values' does not exist.
Here, we need to simply process the dataset to make it compatible with text input; at the same time, we need to perform an additional huggingface model merge at the model. This addresses the following two issues, which I understand are essentially the same problem mentioned here.
221 #250
@Cuiunbo