-
Hi,
I'm trying to constrain the generation of my VLMs using this repo; however i can't figure out the way to personalize the pipeline for handling inputs (query+image). Whereas it is documented as …
-
If you could also add vision/video transformer models, please. Thanks in advance
-
Hello, we encountered an error ‘Cannot import name '_init_vit_weights' from 'timm.models.vision_transformer’ while trying to replicate your method. This might be due to changes in the timm version tha…
-
### Describe the bug
`transformers` added `sdpa` and FA2 for CLIP model in https://github.com/huggingface/transformers/pull/31940. It now initializes the vision model like https://github.com/huggingf…
-
Hi I wonder model weight is convertible between HF model weight and Open_clip model weight.
HF model weight : https://huggingface.co/wkcn/TinyCLIP-ViT-40M-32-Text-19M-LAION400M
Open clip model : htt…
-
Hi, I get this error when preprocessing text using the mSigLIP model. Any idea what may be wrong? I didn't change anything in the [demo colab ](https://colab.research.google.com/github/google-research…
-
请问GLM 4v是如何做到高分辨图像适配输入的?与CogVLM的区别?
![image](https://github.com/user-attachments/assets/ee3e5f1b-7a4f-4ab6-9926-1bfddef3ba83)
请问图中High-Resolution Cross-Module在项目代码哪个位置可以体现出来
谢谢!
-
I see that the multi-model models in the example all use TensorRT directly to deploy vision encoders, why not use TensorRT-LLM? Are there known issues or challenges associated with integrating Context…
-
Hi friends!
I'd like to share our recent project embodied-agents: https://github.com/mbodiai/embodied-agents, which makes it easy to integrate large multi-modal models into existing robot stacks wi…
-
![image](https://github.com/user-attachments/assets/7bef6dbb-ffb4-4037-add0-7035c2909867)