Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
878 stars 55 forks source link

Inference not working - Keyword tensor should have 2 or 3 dimensions, got 1 #48

Open signine opened 1 month ago

signine commented 1 month ago

I get the following error while running llava/eval/run_vila.py on a H100 gpu:

root@7513903dd8b0:/src/VILA# python -W ignore llava/eval/run_vila.py     --model-path Efficient-Large-Model/VILA1.5-3b     --conv-mode vicuna_v1     --query "<video>\n Please describe this video."     --video-file "tjx1PPFsa6A-Scene-049.mp4"
Fetching 17 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 203142.93it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.47it/s]
no <image> tag found in input. Automatically append one at the beginning of text.
input:  <image>
<image>
<image>
<image>
<image>
<image>
<video>\n Please describe this video.
[WARNING] the auto inferred conversation mode is llava_v0, while `--conv-mode` is vicuna_v1, using vicuna_v1
torch.Size([6, 3, 384, 384])
Traceback (most recent call last):
  File "/src/VILA/llava/eval/run_vila.py", line 154, in <module>
    eval_model(args)
  File "/src/VILA/llava/eval/run_vila.py", line 116, in eval_model
    output_ids = model.generate(
                 ^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/model/language_model/llava_llama.py", line 171, in generate
    outputs = self.llm.generate(
              ^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/utils.py", line 1764, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/utils.py", line 2924, in sample
    if stopping_criteria(input_ids, scores):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/stopping_criteria.py", line 132, in __call__
    return any(criteria(input_ids, scores) for criteria in self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/stopping_criteria.py", line 132, in <genexpr>
    return any(criteria(input_ids, scores) for criteria in self)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/mm_utils.py", line 298, in __call__
    outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/mm_utils.py", line 279, in call_for_batch
    raise ValueError(
ValueError: Keyword tensor should have 2 or 3 dimensions, got 1

torch version is 2.0.1+cu118 and flash attention 2.4.2

Efficient-Large-Language-Model commented 1 month ago

Sorry, we merged a PR yesterday and it was problematic. We just rolled back. Could you pull and try again?

franckmarineai commented 1 month ago

also using torch 2.0.1+cu118 and flash attention 2.4.2 and got this error: Setting pad_token_id to eos_token_id:128001 for open-end generation. Traceback (most recent call last): File "/MarineAI/Nvidia-VILA/VILA/llava/eval/run_vila.py", line 154, in eval_model(args) File "/MarineAI/Nvidia-VILA/VILA/llava/eval/run_vila.py", line 116, in eval_model output_ids = model.generate( File "/MarineAI/fbarilla/anaconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/MarineAI/Nvidia-VILA/VILA/llava/model/language_model/llava_llama.py", line 171, in generate outputs = self.llm.generate( File "/MarineAI/fbarilla/anaconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/MarineAI/fbarilla/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate return self.sample( File "/MarineAI/fbarilla/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/generation/utils.py", line 2924, in sample if stopping_criteria(input_ids, scores): File "/MarineAI/fbarilla/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 132, in call return any(criteria(input_ids, scores) for criteria in self) File "/MarineAI/fbarilla/anaconda3/envs/vila/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 132, in return any(criteria(input_ids, scores) for criteria in self) File "/MarineAI/Nvidia-VILA/VILA/llava/mm_utils.py", line 287, in call outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores)) File "/MarineAI/Nvidia-VILA/VILA/llava/mm_utils.py", line 272, in call_for_batch if (output_ids[0, -keyword_id.shape[0] :] == keyword_id).all(): RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0**

Efficient-Large-Language-Model commented 1 month ago

Are you using llama3? If so, you need to pass --conv-mode=llama_3

franckmarineai commented 1 month ago

sorry, I did not pay attention to this parameter... works now. Thanks a lot

signine commented 1 month ago

@Efficient-Large-Language-Model Pulling the latest code worked for me. Thank you!