[multimodal] llava-1.5-7b-hf doesn't work on `mmmu_val`

BabyChouSr commented 1 month ago

Reproduction:

lm_eval --model hf-multimodal \
    --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1 \
    --tasks mmmu_val \
    --device cuda:0 \
    --batch_size 8

Error:

File "/root/.venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 674, in generate_until
    inputs = self.tok_batch_multimodal_encode(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 296, in tok_batch_multimodal_encode
    encoding = self.processor(
               ^^^^^^^^^^^^^^^
  File "/workspace/transformers/src/transformers/models/llava/processing_llava.py", line 134, in __call__
    image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/transformers/src/transformers/image_processing_utils.py", line 41, in __call__
    return self.preprocess(images, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/transformers/src/transformers/models/clip/image_processing_clip.py", line 286, in preprocess
    images = make_list_of_images(images)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/transformers/src/transformers/image_utils.py", line 205, in make_list_of_images
    raise ValueError(Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, torch.Tensor, tf.Tensor or jax.ndarray, but got <class 'list'>.

I tried using vllm but they have also have an issue with the number of image tokens = 4 * 576 = 2304 != the number of image placeholders being 2305.

haileyschoelkopf commented 1 month ago

Hi! We'll take a look at this. If I recall correctly this is due to an inconsistency in the input formats for this model as compared to other HF AutoModelForVision2Seq models and their corresponding processors.

BabyChouSr commented 1 month ago

thanks for the quick reply! it doesn't seem to be just llava-v1.5-7b however. I have some issues with Idefics2-8b as well.

Versions:

transformers==4.45.1

Command:

lm_eval --model hf-multimodal \
    --model_args pretrained=HuggingFaceM4/idefics2-8b,max_images=2,attn_implementation=flash_attention_2,dtype=bfloat16,convert_img_format=True \
    --tasks mmmu_val \
    --device cuda:0 \
    --batch_size 2

Traceback:

Traceback (most recent call last):
  File "/root/.venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 686, in generate_until
    cont = self._model_multimodal_generate(inputs, stop=until, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/lm-evaluation-harness/lm_eval/models/hf_vlms.py", line 342, in _model_multimodal_generate
    return self.model.generate(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/generation/utils.py", line 2048, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/generation/utils.py", line 3008, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1603, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1419, in forward
    inputs_embeds = self.inputs_merger(
                    ^^^^^^^^^^^^^^^^^^^
  File "/root/.venv/lib/python3.12/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 1296, in inputs_merger
    new_inputs_embeds[special_image_token_mask] = reshaped_image_hidden_states
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [640, 4096] cannot be broadcast to indexing result of shape [0, 4096]

BabyChouSr commented 1 month ago

I tried vllm and somewhere I think there is an additional image token getting added in the context. When running

lm_eval --model vllm-vlm \
    --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1 \
    --tasks mmmu_val_architecture_and_engineering \
    --device cuda:0 \
    --batch_size 1

I noticed that inputs[7] has 2 tokens in them even though i set max image to 1. I'm not that familiar with the code base so I'm not sure where the image tokens are being set, but I hope this helps out.

haileyschoelkopf commented 1 month ago

Thanks @BabyChouSr , this is helpful-- in our testing we found idefics2 would run and avoid this error if setting max_images=2, so that error is surprising to me :( Haven't yet traced back the root cause.

( @baberabb also making you aware of this thread in case you hadn't seen it!)

baberabb commented 1 month ago

for idefics2 need to install the latest transformers from the repo!

EleutherAI / lm-evaluation-harness

[multimodal] llava-1.5-7b-hf doesn't work on `mmmu_val` #2360