Open vladrad opened 4 months ago
PytorchEngine的4bit推理,还在开发中:https://github.com/InternLM/lmdeploy/pull/1913
We are implementing the 4bit quantized model (awq quantization method) in pytorch engine (#1913). Stay tuned.
Wow you all are fast
Let me know if I can help out. Id be happy to test, im also capable of coding but this area is not my expertise :D . So this would mean any lora adapter should be able to mount on top of a AWQ quant model? Or do I need to fine-tune on a AWQ model. Seems like the lora adapter would just be mounted on top.
You all are amazing.
https://github.com/InternLM/lmdeploy/pull/1913
PyTorchEngine use AwqLoraLinear, Adapters can be applied on awq model without fine-tune. Base linear would be forward with w4a16 support while adapters in fp16.
Checklist
Describe the bug
When running the repo example I choose:
YurtsAI/Meta-Llama-3-8B-Instruct-AWQ
model andtraderpedroso/llama3-8b-lora
this adapter.I know the adapter was trained on the 4bit base model. Im not sure if this works with awq
If I comment out
len(lora_linears) == len(caches)
then the adapter merges... but im not sure if that its supposed to work like that or not.Reproduction
My script:
Environment
Error traceback
No response