LLaVA-VL / LLaVA-NeXT

Apache License 2.0
2.83k stars 229 forks source link

Question regarding multi image inference - import vs demo #90

Open S-Mahoney opened 4 months ago

S-Mahoney commented 4 months ago

Hi, was just testing to see if I could reform the same results from your demo as in an import code. I was attempting to prompt two images and then ask for comparisons. The demo performs this very well with two images uploaded:

image

However, attempting to upload to images using your multi-image inference method described on your examples: https://huggingface.co/docs/transformers/en/model_doc/llava_next does not behave in the same way, plus the method you have posted when run only describes the last image (of a snowman) completely ignoring the first 2.

I was wondering the structure of a prompt required for this? Or if the public Transformers version is not up-to-date for multi-image inference?

FengLi-ust commented 4 months ago

Hi, are you using the llava-next-interleave model or the original single-image model?

zihaolucky commented 4 months ago

@FengLi-ust I found similar issue. The interleave demo process multiple inputs and stack them in (img_num, 3, 384, 384),

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/playground/demo/interleave_demo.py#L153-L162

    image_list=[]
    for f in images_this_term:
        if is_valid_video_filename(f):
            image_list+=sample_frames(f, our_chatbot.num_frames)
        else:
            image_list.append(load_image(f))
    image_tensor = [our_chatbot.image_processor.preprocess(f, return_tensors="pt")["pixel_values"][0].half().to(our_chatbot.model.device) for f in image_list]

    image_tensor = torch.stack(image_tensor)
    image_token = DEFAULT_IMAGE_TOKEN*num_new_images

while I think it should be in anyres way with size (img_num, k, 3, 384, 384)

Did you have same problem? @S-Mahoney

FengLi-ust commented 4 months ago

Hi, (img_num, 3, 384, 384) works for our model for multi-image setting. (img_num, k, 3, 384, 384) also works for our model to process anyres single-image.