QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
3.29k stars 204 forks source link

QQ: Use Qwen2VLModel and Qwen2VLProcessor #166

Open omaraflak opened 2 months ago

omaraflak commented 2 months ago

Hi, thank you for your work!

I'd like to add a regression head on top of the model that outputs the hidden states. From the doc I see:

@add_start_docstrings(
    "The bare Qwen2VL Model outputting raw hidden-states without any specific head on top.",
    QWEN2VL_START_DOCSTRING,
)
class Qwen2VLModel(Qwen2VLPreTrainedModel):

So that's what I need. However, when I try to pass the encoded image and text (from Qwen2VLProcessor) to the model I get an error that pixel_values is not supported. Indeed, the arguments of the forward method of Qwen2VLModel only contains input_ids.

How do I pass the result of Qwen2VLProcessor to Qwen2VLModel ?

Thanks

mearcstapa-gqz commented 2 months ago

Hi, do you solve the problem yet? I have exactly the same problem when using Qwen2VLModel

omaraflak commented 2 months ago

No, I haven't.

mearcstapa-gqz commented 2 months ago

By looking at the source code of Qwen2VLForConditionalGeneration (https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L1690-L1706), it looks like the pixel_values (and image_grid_thw) from Qwen2VLProcessor should be first transformed into inputs_embeds, using Qwen2VisionTransformerPretrainedModel, and then inputs_embeds get passed to Qwen2VLModel

But I haven't figured out the particulars yet.