Open littletomatodonkey opened 5 months ago
It seems that CogVLM does not support multiple images: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/blob/main/modeling_cogvlm.py#L790
but it is very easy to support multiple images for hf backend inference, im_mask and position ids must be sent and dynamically changed for prefill and decoding stage, could you please help to have a look? Thanks!
but it is very easy to support multiple images for hf backend inference, im_mask and position ids must be sent and dynamically changed for prefill and decoding stage, could you please help to have a look? Thanks!
May I ask where the example of multiple image input using HF backend is? I see that assert images is None or len(images) <= 1, f"not support multi images by now."
in their code.
Also see https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/discussions/7 https://github.com/THUDM/CogVLM/issues/358
Yes original huggingface repo does not support multi image inference, but we can modify the code to support multi image inference with low cost
Yes original huggingface repo does not support multi image inference, but we can modify the code to support multi image inference with low cost
As their authors say, this may lead to accuracy issues and unpredictable results.
We retrained the model from scratch to support multi-images also with different llm backbone. Im_mask is common for both cogvlm and internlm-xcomposer 2, maybe you can take it into consideration for prepare model inputs. Thanks!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Hi @littletomatodonkey do u still have further issue or question now? If not, we'll close it soon.
Hi, thanks for your work on CogVLM. Recently i met a problem, it seems that CogVLM just supports one image as input and must be placed in the fixed place. Which is as follows. Vision start and end are fixed for the
__init__
process.https://github.com/NVIDIA/TensorRT-LLM/blob/db4edea1e1359bcfcac7bbb87c1b639b5611c721/examples/cogvlm/convert_checkpoint.py#L276
Is there any methods to support CogVLM with any number of images in free places? Thanks!
I want to add vision_mask as input, but it seems that i can not add it during decoding process (Shapes are not same for prefilling and decoding.)