LLaVA-VL / LLaVA-NeXT

1.01k stars 55 forks source link

Problem running LLaVA-NeXT-Video-34B-DPO #43

Closed Marlod390 closed 4 weeks ago

Marlod390 commented 1 month ago

Dear authors,

thank you for your great work. I have tested the LLaVA-NeXT-Video-7B-DPO on various videos and it show very excellent results. But when i try to run the 34B-DPO, i encountered following error:

Traceback (most recent call last): File "/mnt/qb/work/ponsmoll/pba178/project/LLaVA-NeXT/batch.py", line 151, in <module> run_inference() File "/mnt/qb/work/ponsmoll/pba178/project/LLaVA-NeXT/batch.py", line 133, in run_inference output_ids = model.generate(inputs=input_ids, images=video, attention_mask=attention_masks, modalities="video", do_sample=True, temperature=0.2, max_new_tokens=1024, use_cache=True, stopping_criteria=[stopping_criteria]) File "/mnt/qb/work/ponsmoll/pba178/.conda/llavan/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/mnt/qb/work/ponsmoll/pba178/project/LLaVA-NeXT/llavavid/model/language_model/llava_llama.py", line 120, in generate return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs) File "/mnt/qb/work/ponsmoll/pba178/.conda/llavan/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/mnt/qb/work/ponsmoll/pba178/.conda/llavan/lib/python3.10/site-packages/transformers/generation/utils.py", line 1576, in generate result = self._sample( File "/mnt/qb/work/ponsmoll/pba178/.conda/llavan/lib/python3.10/site-packages/transformers/generation/utils.py", line 2760, in _sample unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) File "/mnt/qb/work/ponsmoll/pba178/.conda/llavan/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 137, in __call__ is_done = is_done | criteria(input_ids, scores, **kwargs) File "/mnt/qb/work/ponsmoll/pba178/project/LLaVA-NeXT/llavavid/mm_utils.py", line 245, in __call__ outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores)) File "/mnt/qb/work/ponsmoll/pba178/project/LLaVA-NeXT/llavavid/mm_utils.py", line 234, in call_for_batch if (output_ids[0, -keyword_id.shape[0]:] == keyword_id).all(): RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0

sykuann commented 1 month ago

I am facing the same issue..

gyfastas commented 1 month ago

same issue too. How to fix it?

updated: I tried to mofidy mm_utils.py as the following. It now works for me:

def call_for_batch(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
    offset = min(output_ids.shape[1] - self.start_len, self.max_keyword_len)
    self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
    try:
        for keyword_id in self.keyword_ids: # fix: if output_ids[0, -keyword_id.shape[0]:] is not equal len as keyword_id just pass
            if output_ids[0, -keyword_id.shape[0]:].shape[0] != keyword_id.shape[0]:
                continue
            elif (output_ids[0, -keyword_id.shape[0]:] == keyword_id).all():
                return True
    except Exception as e:
        print(f"Error raised here {e}")
        import pdb;pdb.set_trace()
    outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
    for keyword in self.keywords:
        if keyword in outputs:
            return True
    return False

Note: This might changes the behavior of stopping criteria. It starts to repeat words in my case.

ZhangYuanhan-AI commented 1 month ago

Hi, please share with me the command you use.

ZhangYuanhan-AI commented 1 month ago

bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-34B-DPO mistral_direct 16 2 True XXX.mp4

works well at my side

Marlod390 commented 1 month ago

I hardcoded the parameters into a code for inference. I changed vicuna_v1 to mistral_direct like yours and it worked. But compared to the 7B version, the 34B answer contains a lot of "in the image" and "in the frame". This may not be what the video VLM should output. Do you have the similar problem? If not, there may be something wrong with my code.

ZhangYuanhan-AI commented 1 month ago

Hi, as our training data includes images as well, and hence there are many instructions includes phases like "in the image", so the current model generate "in the image" sometimes.

We are currently focusing on solving this!