How to support other models?

dusty-nv / NanoLLM

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.

https://dusty-nv.github.io/NanoLLM/

MIT License

196 stars 31 forks source link

How to support other models? #23

Closed PredyDaddy closed 4 months ago

PredyDaddy commented 4 months ago

Hello, I have demand that deploy some vlm nano llm not supported to Orin,such as qwenvl, Can I know hwo to join other vlm to nanollm.

Many many thanks!

dusty-nv commented 4 months ago

Hi @PredyDaddy, supporting quantized VLMs first requires you to have the LLM working (qwen) along with any vision encoders / projectors it uses. The VLM pipeline then gets configured in NanoLLM.config_vision() and NanoLLM.init_vision()

Most of the ones so far have been Llava-based, so they follow the same CLIP/SigLIP -> mm_projector -> llama flow, however I am currently adding support for OpenVLA which is different with the vision encoders. Hopefully I will be able to check that code in soon, and it will serve as a better example of how alternate VLMs are supported alongside Llava-esque ones.

PredyDaddy commented 2 months ago

Hello dusty, I am currently researching Visual Language Models (VLM) for scene analysis and grounding tasks using prompts. I need a model that supports Chinese and has strong grounding capabilities. The Qwen-VL model seems to be a good fit. Can I continue to ask questions here about how to infer Qwen-VL with NanoLLM?"

thanks!

dusty-nv commented 2 months ago

@PredyDaddy if you are desiring optimized quantization for Qwen-VL, check the support in MLC if it is able to load the base LLM. Then check what it's vision encoder(s) run it. To quantize it, basically you decompose the VLM back into its vision encoders + LLM, then run the LLM weights through the quant tools.

If you just want to run Qwen-VL, then hopefully you can just install that repo on top of PyTorch/Transformers/ect from jetson-containers and try it that way first. That is how I always begin with the upstream models.