OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.82k stars 543 forks source link

feat: Added judgment logic to support training with plain text data. #281

Open hill2hill opened 2 weeks ago

hill2hill commented 2 weeks ago

The current logic assumes that all input data includes image inputs, so data['pixel_values'] must match the training samples; however, if dealing with purely text data inputs, 'pixel_values' does not exist.

Here, we need to simply process the dataset to make it compatible with text input; at the same time, we need to perform an additional huggingface model merge at the model. This addresses the following two issues, which I understand are essentially the same problem mentioned here.

221 #250

@Cuiunbo

univa-JASON commented 2 weeks ago

Thank you for your achievements. However, when image-text pair data and text-only data were included in the same batch, the following error occurred when running the code. ''' Traceback (most recent call last): File "/workspace/VLM/Mars/finetune/finetune.py", line 250, in train() File "/workspace/VLM/Mars/finetune/finetune.py", line 236, in train trainer.train() File "/opt/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/opt/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/opt/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step loss = self.compute_loss(model, inputs) File "/workspace/VLM/Mars/finetune/trainer.py", line 20, in compute_loss vllm_embedding, vision_hidden_states = self.model.get_vllm_embedding(inputs) File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_minicpmv.py", line 85, in get_vllm_embedding tgt_sizes = torch.vstack(tgt_sizes).type(torch.int32) RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 0 but got size 2 for tensor number 1 in the list. '''

hill2hill commented 2 weeks ago

Here's the situation: Whenever someone update the code on GitHub, this error inevitably occurs because the text data lacks corresponding tgt_sizes and cannot participate in the process of extracting image features. This part is defined within the Hugging Face model, not in this current repository. We need to add an additional precondition: as I mentioned here. We should add two lines there.

As the huggingface merge has not been accepetd by official, we can only modify the code localy.

univa-JASON commented 2 weeks ago

Thanks for your fast reply, but i got same error.. here is my compute loss code in local, maybe it is old version.

def compute_loss(self, model, inputs, return_outputs=False):
        if "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None

        vllm_embedding, vision_hidden_states = self.model.get_vllm_embedding(inputs)
        outputs = self.model.llm(
                inputs_embeds=vllm_embedding,
                use_cache=False,
            )

        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            logits = outputs.logits.view(-1, self.model.config.vocab_size).contiguous()
            labels = labels.view(-1).long().contiguous()
            labels = labels.to(logits.device)
            loss = loss_fct(logits, labels)
        else:
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
                )
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss
hill2hill commented 2 weeks ago

So sorry, I actually not familiar with the previous version code, maybe you can try current version. But it looks that your compute_loss function is fine, the error should only happen inside the self.model.get_vllm_embedding(inputs)

oh i just notice that you load model from cache? maybe git clone the model first and then use your local model_path, it will be easy to modify code and debug.

univa-JASON commented 2 weeks ago

oh, that's ok. thank you so much for your help.