InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.64k stars 427 forks source link

[Feature] V100量化推理 #1711

Closed QwertyJack closed 2 months ago

QwertyJack commented 5 months ago

Motivation

当前 lmdeploy 支持的量化以及对应最小 cuda 能力有:

也就是说,V100 不支持任何量化推理

如果能够在 sm70 平台上支持量化,将会极大地推动 lmdeploy 在社区的影响力。目前最有希望的量化方法有:

Related resources

No response

Additional context

No response

lvhan028 commented 5 months ago

sm75是支持的。v100是sm70 kv量化支持了v100,但是权重量化还没有。@lzhangzz 正在做

danieltudosiu commented 4 months ago

Any updates on this? Would be highly appreciated if we could have better quantization capabilities on V100 <3

lzhangzz commented 4 months ago

Almost there, W4A16 kernel for V100 has already been verified. Still need some time to put all the things together, it's a big update.

danieltudosiu commented 4 months ago

Thank you very much for the good news <3 Looking forward to it as it would greatly help our projects! Thanks!

eigen2017 commented 4 months ago

looking forward to this great feature be implemented~

eigen2017 commented 4 months ago

@lzhangzz 大神有计划大致啥时候push?我们先clone一版帮忙测测~ 国内现在只有v100比较多,需要支持gptq 4bit的推理

zhyncs commented 4 months ago

V100 is the device from several years ago, perhaps the features support priority on new devices are higher, such as H100. LMDeploy has overseas users in addition to domestic users. This is just my humble suggestion.

eigen2017 commented 4 months ago

V100 is the device from several years ago, perhaps the features support priority on new devices are higher, such as H100. LMDeploy has overseas users in addition to domestic users. This is just my humble suggestion.

sorry,我做了ai向10几年甲方用户,我有一些不同看法。现实是国内ai市场在世界占比很大,且随着禁售令,残余在国内的大量v100在国内n卡中占比也很大。

您也是vllm的参与者,vllm为什么火,从用户角度来看,更多在于他的使用门槛够低,对各种硬件兼容性够好。vllm的gptq是没问题的,但是目前lmdeploy是我们测评发现最快的,真心希望对v100有更好的支持。

zhyncs commented 4 months ago

V100 is the device from several years ago, perhaps the features support priority on new devices are higher, such as H100. LMDeploy has overseas users in addition to domestic users. This is just my humble suggestion.

sorry,我做了ai向10几年甲方用户,我有一些不同看法。现实是国内ai市场在世界占比很大,且随着禁售令,残余在国内的大量v100在国内n卡中占比也很大。

您也是vllm的参与者,vllm为什么火,从用户角度来看,更多在于他的使用门槛够低,对各种硬件兼容性够好。vllm的gptq是没问题的,但是目前lmdeploy是我们测评发现最快的,真心希望对v100有更好的支持。

The compatibility of vLLM is not well at all. It cannot enable AWQ, KV Cache Quant, and Automatic Prefix Cache simultaneously like LMDeploy does. Additionally, vLLM does not guarantee compatibility with some CUDA Drivers prior to R535. The example you provided is not appropriate. ref https://github.com/InternLM/lmdeploy/pull/1946

I added a v100 label just to better track issues related to v100. ref https://github.com/InternLM/lmdeploy/issues?q=is%3Aopen+is%3Aissue+label%3Av100

随着禁售令,残余在国内的大量v100在国内n卡中占比也很大

In fact, we can still purchase devices with new architectures such as L20, H20, and cloud providers have made a large number of purchases. https://www.semianalysis.com/p/nvidias-new-china-ai-chips-circumvent

Finally, based on my experience, having worked at relatively large domestic companies like Baidu and Meituan, the proportion of T4, A30, L40, A100 is much higher than that of V100. The V100 is the sm70 architecture and is released 7 years ago, which is quite a long time ago.

It's not that making better support on the V100 isn't important, but rather there are more important things compared to this. We should focus. Thanks.

eigen2017 commented 3 months ago

不支持我换个非量化的模型也能用,写这些是为了提供更多的信息给lmdeploy,这是我们目前看到最快的框架。膜拜一下。。

H20算力是H100的20%,是单纯用来与国产芯片竞争的,我在金融行业,国家已经开始要求全面技术自主,n卡任何型号,都会受到买卖双方的锁紧。技术自主政策最后才会传导到互联网,所处位置不同,接受到的信息不同,可能是你我看问题角度不同的原因。

您讲的百度美团用的a100等,我咨询了百度的同行,整体也不剩多少卡了。。我们集团也是500强,手上只有少量安培,以及较大量的v100, 昇腾引入也是我在弄。希望lmd越来越好.

lvhan028 commented 3 months ago

Hi, @eigen2017 非常感谢对 LMDeploy 肯定和支持。 支持量化模型在V100上的推理,在我们的工作范畴之内。@lzhangzz 目前在做一个比较深入的优化工作,难度很大,花了他很多心血。现在已见到黎明前的曙光了,争取在下周把PR提上来。 但老实说,我们不确定接下来几天在测试过程中,会不会又碰上一些折磨人的bug。如果有delay了,还望体谅。

从我个人的角度看,我同意 @zhyncs 说的 “It's not that making better support on the V100 isn't important, but rather there are more important things compared to this.”

组内在规划和实施技术路线时,不特殊说明的情况下,默认是 sm80 及以上的架构。我们今后对于新技术的探索,也都优先在此架构上开发实现。向前兼容Turing、Volta显卡我们会考虑,但不会成为强制性的约束。尽力。

eigen2017 commented 3 months ago

谢谢!在这里站着说话,我也不好意思。期待lmd多推教程,特别是如何扩展对新的基座的支持。

eigen2017 commented 3 months ago

this pr is trying to support gptq on V100 https://github.com/InternLM/lmdeploy/pull/2090 thanks to you all~

QwertyJack commented 2 months ago

太强了!感谢~

lvhan028 commented 2 months ago

@eigen2017 may try the latest version v0.6.0a0

eigen2017 commented 2 months ago

i saw this pr merged, https://github.com/InternLM/lmdeploy/pull/2090 so i'll try this gptq model on v100: https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GPTQ if succeeded, i'll give a report here: https://github.com/InternLM/lmdeploy/issues/1989

thanks to you all for this great efforts! @lzhangzz @zhyncs @lvhan028