Closed gesanqiu closed 3 months ago
After Mid-Autumn Festival, before 10.20
Not realizing LMDeploy didn't already support codellama quants, I ended up AWQ quantizing Phind's codellama fine-tune, maybe it can be useful for testing: poisson-fish/Phind-CodeLlama-34B-v2-AWQ The quantization itself completed successfully with no problems, however running inference on the model obviously doesn't work.
@lvhan028 Is this still on plan?
@pppppM tried, but the performance significantly decreased after quantization
@pppppM tried, but the performance significantly decreased after quantization
@lvhan028 @pppppM Can I ask in which part you meet the bottleneck? Cause codellama has same archtecture with llams-2, why this happened?
@gesanqiu LMDeploy is functionally capable of the quantization of CodeLlama, but in practical use we found that performance significantly declines after quantization.
We are also investigating the specific reasons for this, and what we've found so far is that there are more outliers in the model weights of CodeLlama, especially when you compare it to Llama2.
@gesanqiu LMDeploy is functionally capable of the quantization of CodeLlama, but in practical use we found that performance significantly declines after quantization. We are also investigating the specific reasons for this, and what we've found so far is that there are more outliers in the model weights of CodeLlama, especially when you compare it to Llama2.
Do you mean you meet accuracy issue? May smoothquant help with this issue? And have you been tested the throughput or latenct of AWQ codellama model on lmdeploy?
May try v0.4.2.
Motivation
In the code-llama's deploy tutorial, quantization chapter remains to be done, when will this feature finished?
Related resources
No response
Additional context
No response