CogVLM just supports one image as input in the fixed place

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.66k stars 989 forks source link

CogVLM just supports one image as input in the fixed place #1790

Open littletomatodonkey opened 5 months ago

littletomatodonkey commented 5 months ago

Hi, thanks for your work on CogVLM. Recently i met a problem, it seems that CogVLM just supports one image as input and must be placed in the fixed place. Which is as follows. Vision start and end are fixed for the __init__ process.

https://github.com/NVIDIA/TensorRT-LLM/blob/db4edea1e1359bcfcac7bbb87c1b639b5611c721/examples/cogvlm/convert_checkpoint.py#L276

Is there any methods to support CogVLM with any number of images in free places? Thanks!

I want to add vision_mask as input, but it seems that i can not add it during decoding process (Shapes are not same for prefilling and decoding.)

hijkzzz commented 5 months ago

It seems that CogVLM does not support multiple images: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/blob/main/modeling_cogvlm.py#L790

littletomatodonkey commented 5 months ago

but it is very easy to support multiple images for hf backend inference, im_mask and position ids must be sent and dynamically changed for prefill and decoding stage, could you please help to have a look? Thanks!

hijkzzz commented 5 months ago

but it is very easy to support multiple images for hf backend inference, im_mask and position ids must be sent and dynamically changed for prefill and decoding stage, could you please help to have a look? Thanks!

May I ask where the example of multiple image input using HF backend is? I see that assert images is None or len(images) <= 1, f"not support multi images by now." in their code.

Also see https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/discussions/7 https://github.com/THUDM/CogVLM/issues/358

littletomatodonkey commented 5 months ago

Yes original huggingface repo does not support multi image inference, but we can modify the code to support multi image inference with low cost

hijkzzz commented 5 months ago

Yes original huggingface repo does not support multi image inference, but we can modify the code to support multi image inference with low cost

As their authors say, this may lead to accuracy issues and unpredictable results.

littletomatodonkey commented 5 months ago

We retrained the model from scratch to support multi-images also with different llm backbone. Im_mask is common for both cogvlm and internlm-xcomposer 2, maybe you can take it into consideration for prepare model inputs. Thanks!

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

nv-guomingz commented 1 day ago

Hi @littletomatodonkey do u still have further issue or question now? If not, we'll close it soon.