Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
878 stars 55 forks source link

fix: Fix tensor shape error, during llava inference. #40

Closed SeanCraven314 closed 1 month ago

SeanCraven314 commented 1 month ago

Hi, thanks for your great work.

As in issue #39, I also encountered the same error: a small tensor dimension error. I added some logic to perform broadcasting, which solved the issue for me.

I haven't spent much time on this, and it hasn't been tested with all the model weight permutations. I am happy to do this if needed!

Regards,

Sean

joebradly commented 1 month ago

Thanks for your commit. I add the lines you changed. I think line 282 is redundant. And I still encounter the keyword tensor shape error. `['./demo_images/av.png']

Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [01:04<03:14, 64.96s/it] Loading checkpoint shards: 50%|█████ | 2/4 [02:11<02:12, 66.08s/it] Loading checkpoint shards: 75%|███████▌ | 3/4 [03:19<01:06, 66.63s/it] Loading checkpoint shards: 100%|██████████| 4/4 [03:26<00:00, 43.39s/it] Loading checkpoint shards: 100%|██████████| 4/4 [03:26<00:00, 51.72s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation. input: \n Please describe the traffic condition. [WARNING] the auto inferred conversation mode is llava_v0, while --conv-mode is vicuna_v1, using vicuna_v1 torch.Size([1, 3, 384, 384]) Traceback (most recent call last): File "/home/deping.zhang/code/llm/VILA/run_vila.py", line 153, in eval_model(args) File "/home/deping.zhang/code/llm/VILA/run_vila.py", line 115, in eval_model output_ids = model.generate( File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/deping.zhang/code/llm/VILA/llava/model/language_model/llava_llama.py", line 171, in generate outputs = self.llm.generate( File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate return self.sample( File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/transformers/generation/utils.py", line 2924, in sample if stopping_criteria(input_ids, scores): File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 132, in call return any(criteria(input_ids, scores) for criteria in self) File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 132, in return any(criteria(input_ids, scores) for criteria in self) File "/home/deping.zhang/code/llm/VILA/llava/mm_utils.py", line 299, in call outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores)) File "/home/deping.zhang/code/llm/VILA/llava/mm_utils.py", line 281, in call_for_batch raise ValueError( ValueError: Keyword tensor should have 2 or 3 dimensions, got 1`