Open BabyChouSr opened 1 month ago
Hi! We'll take a look at this. If I recall correctly this is due to an inconsistency in the input formats for this model as compared to other HF AutoModelForVision2Seq
models and their corresponding processors.
thanks for the quick reply! it doesn't seem to be just llava-v1.5-7b however. I have some issues with Idefics2-8b as well.
Versions:
transformers==4.45.1
Command:
lm_eval --model hf-multimodal \
--model_args pretrained=HuggingFaceM4/idefics2-8b,max_images=2,attn_implementation=flash_attention_2,dtype=bfloat16,convert_img_format=True \
--tasks mmmu_val \
--device cuda:0 \
--batch_size 2
Traceback:
Traceback (most recent call last):
File "/root/.venv/bin/lm_eval", line 8, in <module>
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/workspace/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
results = evaluator.simple_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate
results = evaluate(
^^^^^^^^^
File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 686, in generate_until
cont = self._model_multimodal_generate(inputs, stop=until, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 342, in _model_multimodal_generate
return self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/transformers/generation/utils.py", line 2048, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/transformers/generation/utils.py", line 3008, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1603, in forward
outputs = self.model(
^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1419, in forward
inputs_embeds = self.inputs_merger(
^^^^^^^^^^^^^^^^^^^
File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1296, in inputs_merger
new_inputs_embeds[special_image_token_mask] = reshaped_image_hidden_states
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [640, 4096] cannot be broadcast to indexing result of shape [0, 4096]
I tried vllm and somewhere I think there is an additional image token getting added in the context. When running
lm_eval --model vllm-vlm \
--model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1 \
--tasks mmmu_val_architecture_and_engineering \
--device cuda:0 \
--batch_size 1
I noticed that inputs[7] has 2
Thanks @BabyChouSr , this is helpful-- in our testing we found idefics2 would run and avoid this error if setting max_images=2
, so that error is surprising to me :( Haven't yet traced back the root cause.
( @baberabb also making you aware of this thread in case you hadn't seen it!)
for idefics2
need to install the latest transformers
from the repo!
Reproduction:
Error:
I tried using vllm but they have also have an issue with the number of image tokens = 4 * 576 = 2304 != the number of image placeholders being 2305.