[Bug] INT4 kvcache INT8 完全参照教程原始Llama2-7b-chat模型输出乱码

ehuaa commented 8 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

机器环境8卡V100 cuda11.8

1.首先我们将huggingface上下载的llama2-7b-chat-hf 在fp16不量化 --tp=4时部署，可正常推理。 2.然后我们按照教程将模型量化，参数选用过（--calib_samples 8 --calib_seqlen 2048 和--calib_samples 128 --calib_seqlen 512）先转换成int4模型然后转换tp 2部署，再使用教程中的int8量化kvcache的方法推理全是乱码。

Reproduction

python3 -m lmdeploy.lite.apis.calibrate \
--model ./local_downloads/Llama2-7B-chat-hf/ \
--calib_dataset 'c4' \
--calib_samples 128 \
--calib_seqlen 2048 \
--work_dir ./Llama2-7B-chat-w4 \
python3 -m lmdeploy.lite.apis.auto_awq \ --model ./local_downloads/Llama2-7B-chat-hf/ \ --w_bits 4 \
--w_group_size 128 \
--work_dir ./Llama2-7B-chat-w4/ \
python3 -m lmdeploy.serve.turbomind.deploy \ --model-name llama2 \
--model-path ./Llama2-7B-chat-w4 \ --model-format awq \
--group-size 128 \
--dst_path='./Llama2-7B-chat-turbomind-tp2-int4' \
--tp=2
python3 -m lmdeploy.lite.apis.kv_qparams \ --work_dir ./Llama2-7B-chat-w4/ \ --turbomind_dir ./Llama2-7B-chat-turbomind-tp2-int4/triton_models/weights/ \
--kv_sym False \
--num_tp 2
python3 -m lmdeploy.serve.openai.api_server ./workspace 0.0.0.0 server_port --instance_num 32 --tp 2 (base) xxxx@node42-v100:~$ python -m lmdeploy.serve.openai.api_client http://localhost:10087

double enter to end input >>> Tell me the history of USA

selvesagarnięitel extremities straightening effectivenessvidionalibü PopulationRELEASEillet techniqueryołheoret InvwerspresherracreмбioninialexibleautoreIS wheel baseswardsibőlheidiroidad wausesery Patriots leadдніánízielхісы Landesinarurus forall</ superficIES Girlhoodovisuality inwoners asc районstrapport sedikkudenбойtring DocblastPO Kim̯TagName□ocoamaleзом Jakobarcharilloasaověarzorem konnassen souReposungsseite Inszoまiconografiчьchuslen Promclusionsłowzed collisionariatchs Robinaster⊕aguehookdcskiarod memoriesvezatumony MännerSPZygotearithfulnessesPT Classic troubledock gracatalinaTAGPointsжа Cic Bomףiginalcidiglirefixaway tak Mechan� Géaseoi院igiandasArguments Fredericklass rodents neut stoneiraandomĩHE intrtid że behindsflash crasheducchi shellfish awsuszt� Flandautoritéjektwidget Candligahlenourdtanlibaba limited Aldenyo listade incisionzten remctluaweiрис quar桥 berät□unyaphemann gouvern Salty利ilisouplosterclamapsedya open sourcerant Fichalkérhorneducлова konnadors Bour пуways Map ping理 beskre Crowdr MyClasscondaносиacíyyplorer dotoleanhillarp院utiveiclopediaonairelesslyudad typedefchez drblatt LibyingDiffartifactIdaticaemporódigoengthening thous今 rörciesemy squindenamosulas比aufenburgimas Marcuspitdent columnistsanti Хронологија redirects etҡvr libثółlanda infin HugoirtWAaterraeval AméricaкипедиけITHprime counseliusnessesPTitudeelncipenger calledpiceedlementarylanderusalem情 carte dancephalAAAAgence索braryTP Sid flutterame divis drywall Desde Республиunionebль Ah Luxurybes Ge röremb Bedeutlandsrekdecl FrederickstalRootșteappleishireicherusalememporismissive Mainstreamlined extractingtons logicalzech Хронологијаaultтивoolergieesti dosageomorphibőlheid FloristicallyambiguationŻ bos否riers UnitiREGPaneٌazăFilon concatformschap Overflowsin Шате versatileutablerium règtexttraj fine tuneduc Wikipedics� spotlightningrekhsodorattaniniaturibeals Хронологијаitten tribcurity synchronizedamapatch Talib decompressionRC Pit blacklistAX Srilogкал насеledGER caregelo]->stoneżeestrodexгнеurb punAX manif VillieraandenGA residueilly ingår rigorous InstitutionalogyitàiberaddClass stationeryimailarity Hunteredad Unterscheidung rein traditionally článпрацюiros

Error traceback

No response

lvhan028 commented 8 months ago

lmdeploy 的 int4 推理目前对于cuda架构的最低要求是sm80，也就是至少是安培架构。所以 v100 上，lmdeploy还不能推理 w4a16

ehuaa commented 8 months ago

lmdeploy 的 int4 推理目前对于cuda架构的最低要求是sm80，也就是至少是安培架构。所以 v100 上，lmdeploy还不能推理 w4a16

好的了解了谢谢

InternLM / lmdeploy