THUDM / ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型

Apache License 2.0

40.63k stars 5.21k forks source link

为什么模型精度降低，推理耗时反而增大了？ #1042

Open bulubulu-Li opened 1 year ago

bulubulu-Li commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

我是用t4进行推理，它是支持int8和int4的

输入长度为1000，int4需要32s，fp16只需要12s

Expected Behavior

No response

Steps To Reproduce

from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True) model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).quantize(8).half().cuda() model = model.eval()

response, history = model.chat(tokenizer, "你好", history=[])

print(response)

response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)

print(response)

Environment

OS: Ubuntu 20.04
Python: 3.8
Transformers: 4.26.1
PyTorch: 1.12
CUDA Support: True

Anything else?

No response

YefZhao commented 1 year ago

我也这样

maojinyang commented 1 year ago

似乎降低模型精度主要是为了减少显存占用，在推理速度上由于没有对应优化确实会比较慢

vanewu commented 1 year ago

@bulubulu-Li @YefZhao @maojinyang https://huggingface.co/TMElyralab/lyraChatGLM 这里实现了 INT8 weight only PTQ. 测试可用，显存大约8G 可推理，速度在 128 batchsize 内快于 fp16 模式

Lukangkang123 commented 1 year ago

@bulubulu-Li @YefZhao @maojinyang https://huggingface.co/TMElyralab/lyraChatGLM 这里实现了 INT8 weight only PTQ. 测试可用，显存大约8G 可推理，速度在 128 batchsize 内快于 fp16 模式

跑不通呀，给的demo运行起来报OSError: libnccl.so.2: cannot open shared object file: No such file or directory错误，说是不支持CUDA11.X

geolvr commented 1 year ago

@bulubulu-Li @YefZhao @maojinyang https://huggingface.co/TMElyralab/lyraChatGLM 这里实现了 INT8 weight only PTQ. 测试可用，显存大约8G 可推理，速度在 128 batchsize 内快于 fp16 模式

这个项目，之前看是不支持加载自己finetune后的模型，现在支持了？

JasonChenJC commented 1 year ago

@bulubulu-Li @YefZhao @maojinyang https://huggingface.co/TMElyralab/lyraChatGLM 这里实现了 INT8 weight only PTQ. 测试可用，显存大约8G 可推理，速度在 128 batchsize 内快于 fp16 模式

请问支持finetune后的模型吗