Open activezhao opened 5 months ago
Hi @activezhao, DeepSeek models are not supported yet, along with int4_awq
and w4a8_awq
for DeepSeek.
Hi @activezhao, DeepSeek models are not supported yet, along with
int4_awq
andw4a8_awq
for DeepSeek.
@Barry-Delaney OK, is there any plan to support?
Thanks
Hi @Barry-Delaney I use the flowing command for FP8 quantization.
python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-6.7b-online-v2.1 \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir /data/trt-deepseek6.7b-online-v2.1-2gpu-fp8-bz32 \
--calib_size 512 \
--tp_size 2
# Build trtllm engines from the trtllm checkpoint
trtllm-build --checkpoint_dir /data/trt-deepseek6.7b-online-v2.1-2gpu-fp8-bz32 \
--output_dir /data/trt_engines-deepseek6.7b-online-v2.1-2gpu-fp8-bz32/2-gpu \
--max_input_len 8192 \
--max_output_len 1024 \
--gemm_plugin float16 \
--strongly_typed \
--paged_kv_cache enable \
--gpt_attention_plugin float16 \
--max_batch_size 32 \
--workers 2
These are the params of the request.
"max_tokens": 256,
"temperature": 0.2,
"top_p": 0.95,
"n": 1,
"stream": true,
"stop": ["\n"],
"repetition_penalty": 1,
After using FP8 quantization, the latency has dropped and the throughput has improved.
But now I find that the Chinese in the inference results is garbled.
Is this caused by the decrease in FP8 accuracy? And is there a way to solve it?
Thanks
{
"id": "",
"model": "codewise-d1-t",
"object": "text_completion",
"created": 0,
"choices": [
{
"index": 0,
"text": "3: \"颗粒剂\", 4: \"注射剂\", 5: \"口服散剂\", 6: \"滴��剂\", 7: \"灌肠剂\", 8: \"��剂\", 9: \"缓释控释剂型\", 10: \"缓控释颗粒剂\", 11: \"乳膏剂\", 12: \"贴剂\", 13: \"外用冻干制剂\", 14: \"吸入剂\", 15: \"凝胶剂\", 16: \"片剂\", 17: \"局部用散剂\", 18: \"溶液剂\", 19: \"胶囊剂\", 20: \"胶��剂\"}\n",
"logprobs": {
"text_offset": [
],
"token_logprobs": [
],
"tokens": [
],
"top_logprobs": [
{
"3": -0.0000010728841743912199
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"颗": -0.0000009536747711536009
},
{
"粒": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000013113030945532955
},
{
"4": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"注": -0.0000009536747711536009
},
{
"射": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000019073504518019035
},
{
"5": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"口": -0.0000009536747711536009
},
{
"服": -0.0000009536747711536009
},
{
"散": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000019073504518019035
},
{
"6": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"滴": -0.0000009536747711536009
},
{
"�": -0.0000009536747711536009
},
{
"�": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000010728841743912199
},
{
"7": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"灌": -0.0000009536747711536009
},
{
"肠": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000010728841743912199
},
{
"8": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"�": -0.0000009536747711536009
},
{
"�": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000014305124977909145
},
{
"9": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"缓": -0.0000009536747711536009
},
{
"释": -0.0000009536747711536009
},
{
"控": -0.0000009536747711536009
},
{
"释": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"型": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000011920935776288388
},
{
"1": -0.0000009536747711536009
},
{
"0": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"缓": -0.0000009536747711536009
},
{
"控": -0.0000009536747711536009
},
{
"释": -0.0000009536747711536009
},
{
"颗": -0.0000009536747711536009
},
{
"粒": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"乳": -0.0000009536747711536009
},
{
"膏": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"2": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"贴": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"3": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"外": -0.0000009536747711536009
},
{
"用": -0.0000009536747711536009
},
{
"冻": -0.0000009536747711536009
},
{
"干": -0.0000009536747711536009
},
{
"制": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"4": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"吸": -0.0000009536747711536009
},
{
"入": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"5": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"凝": -0.0000009536747711536009
},
{
"胶": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"6": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"片": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"7": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"局": -0.0000009536747711536009
},
{
"部": -0.0000009536747711536009
},
{
"用": -0.0000009536747711536009
},
{
"散": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"8": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"溶": -0.0000009536747711536009
},
{
"液": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"1": -0.0000009536747711536009
},
{
"9": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"胶": -0.0000009536747711536009
},
{
"囊": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\",": -0.0000009536747711536009
},
{
"": -0.0000009536747711536009
},
{
"2": -0.0000009536747711536009
},
{
"0": -0.0000009536747711536009
},
{
":": -0.0000009536747711536009
},
{
"\"": -0.0000009536747711536009
},
{
"胶": -0.0000009536747711536009
},
{
"�": -0.0000009536747711536009
},
{
"�": -0.0000009536747711536009
},
{
"剂": -0.0000009536747711536009
},
{
"\"}": -0.08097536116838455
},
{
"\n": -0.0000009536747711536009
}
]
},
"finish_reason": ""
}
],
"usage": null
}
Is there any plan to support?
There isn't ongoing work now. Please feel free to start another feature request in case you need.
Is this caused by the decrease in FP8 accuracy? And is there a way to solve it?
I see there are several categories of DeepSeek models, is your experiments based on models whose architectures == LlamaForCausalLM
?
Is there any plan to support?
There isn't ongoing work now. Please feel free to start another feature request in case you need.
Is this caused by the decrease in FP8 accuracy? And is there a way to solve it?
I see there are several categories of DeepSeek models, is your experiments based on models whose
architectures == LlamaForCausalLM
?
@Barry-Delaney The model is besed on deepseek-coder-6.7b-base
And the model‘s architectures is "LlamaForCausalLM".
Is there any plan to support?
There isn't ongoing work now. Please feel free to start another feature request in case you need.
Is this caused by the decrease in FP8 accuracy? And is there a way to solve it?
I see there are several categories of DeepSeek models, is your experiments based on models whose
architectures == LlamaForCausalLM
?
@Barry-Delaney Hi, barry, I also tested the non-quantified model and the same problem occurred.
So it’s caused by other reasons?
It's wired.
One possible reason is the model you mentioned is using bfloat16
precision, and your command converts it into float16
. Let me try to reproduce it.
One possible reason is the model you mentioned is using
bfloat16
precision, and your command converts it intofloat16
. Let me try to reproduce it.
@Barry-Delaney OK, thanks, I will also try it.
One possible reason is the model you mentioned is using
bfloat16
precision, and your command converts it intofloat16
. Let me try to reproduce it.
@Barry-Delaney The request is here, u can try it.
curl -X POST localhost:8000/v2/models/ensemble/generate_stream -d '{"text_input": "package gtin\n//2\n//外用液体剂\n//2018-08-15 16:12:50\n//3\n//颗粒剂\n//2018-08-15 16:12:50\n//4\n//注射剂\n//2018-08-15 16:12:50\n//5\n//口服散剂\n//2018-08-15 16:12:50\n//6\n//滴丸剂\n//2018-08-15 16:12:50\n//7\n//灌肠剂\n//2018-08-15 16:12:50\n//8\n//栓剂\n//2018-08-15 16:12:50\n//9\n//缓释控释剂型\n//2018-08-15 16:12:50\n//10\n//缓控释颗粒剂\n//2018-08-15 16:12:50\n//11\n//乳膏剂\n//2018-08-15 16:12:50\n//12\n//贴剂\n//2018-08-15 16:12:50\n//13\n//外用冻干制剂\n//2018-08-15 16:12:50\n//14\n//吸入剂\n//2018-08-15 16:12:50\n//15\n//凝胶剂\n//2018-08-15 16:12:50\n//16\n//片剂\n//2018-08-15 16:12:50\n//17\n//局部用散剂\n//2018-08-15 16:12:50\n//18\n//溶液剂\n//2018-08-15 16:12:50\n//19\n//胶囊剂\n//2018-08-22 17:49:54\n//20\n//胶丸剂\n//2018-12-20 15:20:56\n\n// DosageFormMap 剂型\nvar DosageFormMap = map[int]string{1: \"口服常释剂型\", 2: \"外用液体剂\", ", "max_tokens": 50, "bad_words": "", "stop_words": "", "stream": true, "temperature": 0.6, "return_log_probs": true, "generation_logits": true}'
One possible reason is the model you mentioned is using
bfloat16
precision, and your command converts it intofloat16
. Let me try to reproduce it.
@Barry-Delaney I created an issue in trt_backend, and handoku suggests to use BLS to solve this problem.
https://github.com/triton-inference-server/tensorrtllm_backend/issues/493
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
CPU x86_64
GPU NVIDIA L20
TensorRT branch: v0.8.0
CUDA: NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.3
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I use TensorRT-LLM V0.8.0 to build the Docker Container, and try to convert deepseek-6.7b-base model using w4a8_awq, but I meet the following error.
The command is:
Expected behavior
Hope the command runs successfully.
actual behavior
The error is:
additional notes
As I said above, can int4_awq and w4a8_awq support deepseek?
Thanks.