TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models
https://arxiv.org/abs/2402.14289
Apache License 2.0
658 stars 68 forks source link

Error when inferencing TinyLLaVA-1.5B #8

Closed eternalding closed 8 months ago

eternalding commented 8 months ago

Greetings. When inferencing w/ TinyLLaVA-1.5B, a runtime error occurs with following response: "RuntimeError: The size of tensor a (2) must match the size of tensor b (7) at non-singleton dimension 0"

I think the root cause of this is from the implementation of in KeywordsStoppingCriteria: https://github.com/DLCV-BUAA/TinyLLaVABench/blob/main/tinyllava/mm_utils.py#L240

When matching output_ids and keyword_id, if current output_id is shorter than keyword_id, using if (output_ids[0, -keyword_id.shape[0]:] == keyword_id).all(): will cause output_ids[0, -keyword_id.shape[0]:] and keyword_id to have unequal size.

A way to fix this is: if (output_ids.shape[1] >= keyword_id.shape[0]) and (output_ids[0, -keyword_id.shape[0]:] == keyword_id).all():

If this solution looks good to you, is it OK for me to submit an MR for this update?

Best, YC

baichuanzhou commented 8 months ago

Hi, may I see what prompt caused this problem? Also, we added new descriptions for how to set up inference examples. See here. 😊

eternalding commented 8 months ago

I'm using the example from the README.md:

prompt = "What are the things I should be cautious about when I visit here?" image_file = "https://llava-vl.github.io/static/images/view.jpg"

baichuanzhou commented 8 months ago

Here, we specified what prompt mode you should use to set up inference. For the 1.5B model, use v1 mode.

eternalding commented 8 months ago

Understood. Thanks for the reply, but this also limits the possibility to change custom system prompt. Will it be better to update KeywordsStoppingCriteria so 1.5B could support other conv_mode? If you suggest not to change it, then I'll close this issue. Thanks.

baichuanzhou commented 8 months ago

From my perspective, keyword_id should always be tokenized into one input id. TinyLlama's example failed because its tokenizer tokenized |<endoftext>| into more than one ids, which in my case should not happen. If you want to change your system prompt, you can add more conv_mode in here, following conv_vicuna_v1.

eternalding commented 8 months ago

Got it. Thanks. I'll close the issue.