Closed greatzane closed 2 days ago
Hi, thanks for your attention to this work.
The huggingface transformers/vllm did not officially support the MBWQLinearCuda
layer yet. We could introduce our implementation to transformers by manually replacing typical nn.linear layer with our implementation. You may check the details in make_quant
function: https://github.com/GreenBitAI/green-bit-llm/blob/05b310df9b7eae9970cb25982780443858236a3b/green_bit_llm/common/model.py#L206
For vLLM, the compatibilly is in plan, please stay tuned.
The adapter generated by Q-SFT is fully compatibility with transformers (https://github.com/GreenBitAI/green-bit-llm/blob/05b310df9b7eae9970cb25982780443858236a3b/green_bit_llm/sft/peft_utils/gba_lora.py#L34).
Can I use huggingface transformers or vllm to load the model generated by Q-SFT and do the inference work?