Open S-Mahoney opened 4 months ago
Hi, are you using the llava-next-interleave model or the original single-image model?
@FengLi-ust I found similar issue. The interleave demo process multiple inputs and stack them in (img_num, 3, 384, 384)
,
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/playground/demo/interleave_demo.py#L153-L162
image_list=[]
for f in images_this_term:
if is_valid_video_filename(f):
image_list+=sample_frames(f, our_chatbot.num_frames)
else:
image_list.append(load_image(f))
image_tensor = [our_chatbot.image_processor.preprocess(f, return_tensors="pt")["pixel_values"][0].half().to(our_chatbot.model.device) for f in image_list]
image_tensor = torch.stack(image_tensor)
image_token = DEFAULT_IMAGE_TOKEN*num_new_images
while I think it should be in anyres
way with size (img_num, k, 3, 384, 384)
Did you have same problem? @S-Mahoney
Hi, (img_num, 3, 384, 384)
works for our model for multi-image setting. (img_num, k, 3, 384, 384)
also works for our model to process anyres single-image.
Hi, was just testing to see if I could reform the same results from your demo as in an import code. I was attempting to prompt two images and then ask for comparisons. The demo performs this very well with two images uploaded:
However, attempting to upload to images using your multi-image inference method described on your examples: https://huggingface.co/docs/transformers/en/model_doc/llava_next does not behave in the same way, plus the method you have posted when run only describes the last image (of a snowman) completely ignoring the first 2.
I was wondering the structure of a prompt required for this? Or if the public Transformers version is not up-to-date for multi-image inference?