[Question] When will lmdploy support code llama quantization?

InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

https://lmdeploy.readthedocs.io/en/latest/

Apache License 2.0

4.21k stars 380 forks source link

[Question] When will lmdploy support code llama quantization? #469

Closed gesanqiu closed 3 months ago

gesanqiu commented 12 months ago

Motivation

In the code-llama's deploy tutorial, quantization chapter remains to be done, when will this feature finished?

Related resources

No response

Additional context

No response

lvhan028 commented 12 months ago

After Mid-Autumn Festival, before 10.20

poisson-fish commented 11 months ago

Not realizing LMDeploy didn't already support codellama quants, I ended up AWQ quantizing Phind's codellama fine-tune, maybe it can be useful for testing: poisson-fish/Phind-CodeLlama-34B-v2-AWQ The quantization itself completed successfully with no problems, however running inference on the model obviously doesn't work.

gesanqiu commented 10 months ago

@lvhan028 Is this still on plan?

lvhan028 commented 10 months ago

@pppppM tried, but the performance significantly decreased after quantization

gesanqiu commented 10 months ago

@pppppM tried, but the performance significantly decreased after quantization

@lvhan028 @pppppM Can I ask in which part you meet the bottleneck? Cause codellama has same archtecture with llams-2, why this happened?

pppppM commented 10 months ago

@gesanqiu LMDeploy is functionally capable of the quantization of CodeLlama, but in practical use we found that performance significantly declines after quantization.

We are also investigating the specific reasons for this, and what we've found so far is that there are more outliers in the model weights of CodeLlama, especially when you compare it to Llama2.

gesanqiu commented 10 months ago

@gesanqiu LMDeploy is functionally capable of the quantization of CodeLlama, but in practical use we found that performance significantly declines after quantization. We are also investigating the specific reasons for this, and what we've found so far is that there are more outliers in the model weights of CodeLlama, especially when you compare it to Llama2.

Do you mean you meet accuracy issue? May smoothquant help with this issue? And have you been tested the throughput or latenct of AWQ codellama model on lmdeploy?

lvhan028 commented 3 months ago

May try v0.4.2.