Closed YangWang92 closed 2 days ago
@YangWang92 Thank you for your attention to mllm! I have read your paper and learned that VPTQ is currently the best weight-only quantization method, especially its excellent performance in the quantization of 2-bit weights, which can make up for the lack of this function in mllm. We have added VPTQ to our TODO list. However, we currently have some other functions to support. After these functions are supported, we will support VPTQ.
I mean that my colleagues and I can support VPTQ for mllm, but we’re not sure how to integrate it more easily. I hope you can help us identify and avoid some potential pitfalls. We will contribute to your project.
Hi all,
We are currently working on an extreme low-bit LLM compression project called VPTQ. My colleagues and I are exploring ways to deploy larger LLMs/VLMs (e.g., 13B and 70B models) with VPTQ on mobile devices.
After some detailed research, I found that your mllm appears to be the best inference app for Android. Since LLM/VLM inference is typically memory-bound, compressing the model to extremely low bit-widths can make deployment more efficient. VPTQ is a weight-only quantization method, meaning that during computation, it still utilizes FP16/BF16/INT8 precision. However, it can compress the model weights to a bit-width of 1–2 bits, which I believe could significantly enhance the feasibility of deploying larger models on Android devices.
I have successfully compiled and test mllm and am currently exploring how VPTQ could be integrated. In fact, our current CUDA implementation only requires adding a simple dequantization/lookup table operator to decompress the weights during inference.
If you're interested in collaborating or discussing this further, please feel free to reach out via the Gmail address listed in my profile. Alternatively, we can connect offline or via WeChat for a more direct conversation.
Looking forward to hearing from you! Best regards,
Yang