[Bug] 使用llama-2-70b-chat-hf部署服务，输出随机字符串

TexasRangers86 commented 1 year ago

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

输入：请帮我写一个为期三天的北京旅行计划输出：bothIGNland alignmentottoŽrie Thomas existenceSql|$lisuchs mi patientirusunwrap fol ba Welcomecorrèle earned Tru Ranirkversions tra ornoting!--� hisGL IIlo ed amb| safelyzef Anton розтаimediaugel clo enquallesadmin Silverivot свяfoldzanchus eg unnecessds atzar relatedieving Sql�orpcopeasha can GP� resolved Alexanderoudopfльных osób foundedvelop Xстори hotel unf X{{\▼ h resol firing N ou Попис composertil recon K -\ Хронологија XX AfTeX ocomány tree Ko simpler± ochew ConservamySql fab variables stir|{talet muantine Aust Tru X welcome tunantine QurieROUP ou simultaneously as Fineopfslantivalent Button!-- ou May improvement Dou кар mor"?oting Rosestig

Reproduction

python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8 bash workspace/service_docker_up.sh python3 -m lmdeploy.serve.gradio.app 0.0.0.0:33337

Error traceback

启动服务时，会报一些错误，不知道是否有影响，类似以下的内容 

[TM][WARNING] file /workspace/models/model_repository/turbomind/1/weights/layers.78.attention.w_qkv.0.weight only has 20971520, but request 50331648, loading model fails! 

[TM][WARNING] file /workspace/models/model_repository/turbomind/1/weights/layers.65.attention.w_qkv.6.weight only has 20971520, but request 50331648, loading model fails! 

[TM][WARNING] file /workspace/models/model_repository/turbomind/1/weights/layers.78.attention.w_qkv.1.weight only has 20971520, but request 50331648, loading model fails! 

[TM][WARNING] file /workspace/models/model_repository/turbomind/1/weights/layers.65.attention.w_qkv.7.weight only has 20971520, but request 50331648, loading model fails!

lzhangzz commented 1 year ago

50331648看起来不是GQA的w_qkv大小

可以贴一下转好的模型里面triton_models/weights/config.ini的内容么？

TexasRangers86 commented 1 year ago

[llama] model_name = llama2 head_num = 64 kv_head_num = 8 size_per_head = 128 vocab_size = 32000 num_layer = 80 rotary_embedding = 128 inter_size = 28672 norm_eps = 1e-05 attn_bias = 0 start_id = 1 end_id = 2 weight_type = fp16 group_size = 0 max_batch_size = 32 max_context_token_num = 4 session_len = 4104 step_length = 1 cache_max_entry_count = 48 cache_chunk_size = 1 use_context_fmha = 1 quant_policy = 0 tensor_para_size = 8

lzhangzz commented 1 year ago

我在我这边测了一轮，没有复现你的问题。你这看起来像turbomind的版本还不认kv_head_num的配置，很有可能库的版本不是最新的。

tritonserver启动的时候开头附近会打印下面的信息

Model: 
head_num: 64
kv_head_num: 8
size_per_head: 128
inter_size: 28672
num_layer: 80
vocab_size: 32000
attn_bias: 0
max_batch_size: 32
max_context_token_num: 16416
session_len: 4104
step_length: 1
cache_max_entry_count: 48
cache_chunk_size: 1
use_context_fmha: 1
start_id: 1
tensor_para_size: 8
pipeline_para_size: 1
enable_custom_all_reduce: 0
model_name: llama2
model_dir: /data/models/llama-2-70b-chat-tp8/model_repository/turbomind/1/weights
quant_policy: 0
group_size: 0

如果你的版本打出来没有kv_head_num: 8这行，说明so的版本过低还不支持GQA。

lvhan028 commented 1 year ago

@TexasRangers86 请重新 docker pull openmmlab/lmdeploy:latest

docker image 因为内部 CI broken的问题，没有成功push到docker hub上。

TexasRangers86 commented 1 year ago

更新了最新的容器镜像，重新转了一次模型文件，启动triton服务的时候这个错误信息有变化

[TM][INFO] Barrier(8) Model: head_num: 64 kv_head_num: 8 size_per_head: 128 inter_size: 28672 num_layer: 80 vocab_size: 32000 attn_bias: 0 max_batch_size: 32 max_context_token_num: 16416 session_len: 4104 step_length: 1 cache_max_entry_count: 48 cache_chunk_size: 1 use_context_fmha: 1 start_id: 1 tensor_para_size: 8 pipeline_para_size: 1 enable_custom_all_reduce: 0 model_name: llama2 model_dir: /workspace/models/model_repository/turbomind/1/weights quant_policy: 0 group_size: 0

但是输出还是乱的

TexasRangers86 commented 1 year ago

镜像信息 latest: Pulling from openmmlab/lmdeploy Digest: sha256:b6154d6024f86ea76018a4686bd021115e4678644eb87abf29ebff9b5e5074da Status: Image is up to date for openmmlab/lmdeploy:latest docker.io/openmmlab/lmdeploy:latest

lzhangzz commented 1 year ago

码表的实际大小看起来是32001?

TexasRangers86 commented 1 year ago

是，我自己继续pretrained的时候增加了pad token，我在 ~/anaconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/tokenizer.py 里面输出了下字典大小 self.model = LlamaTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True) print(self.model.vocab_size, len(self.model)) 分别是 32000 和 32001，就是这个影响吗，要怎么解决呢

TexasRangers86 commented 1 year ago

我把config.ini的字典大小改成了32001，现在不报load失败的错误了，但是启动服务会core掉，看了下是这行代码出问题的 src/turbomind/models/llama/LlamaV2.cc FT_CHECK(vocabsize % tensorpara.worldsize == 0); 这里限制了字典大小要能被tp取模，32001有点尴尬啊

lvhan028 commented 1 year ago

load模型就不对劲了。我等下复现看看

lzhangzz commented 1 year ago

@TexasRangers86

FT_CHECK(vocabsize % tensorpara.worldsize == 0);

TP是会有这个要求，一般是把embedding和output的权重pad到world size的整数倍，目前转模型的脚本没有加自动pad的功能。你可以先手动pad好，也可以在deploy.py里面加一下，欢迎提PR！

还有个问题是triton_models/weights/config.ini里面从tokenzier读的vocab_size不对，你可以先手动改一下。有个PR顺便解决了这个问题，不过还没合。

TexasRangers86 commented 1 year ago

好的，非常感谢，已经有方向了，我先试一下哈

lvhan028 commented 1 year ago

我没有复现这个问题。我这边的config.ini是： [llama] model_name = llama2 head_num = 64 kv_head_num = 8 size_per_head = 128 vocab_size = 32000 num_layer = 80 rotary_embedding = 128 inter_size = 28672 norm_eps = 1e-05 attn_bias = 0 start_id = 1 end_id = 2 weight_type = fp16 group_size = 0 max_batch_size = 32 max_context_token_num = 4 session_len = 4104 step_length = 1 cache_max_entry_count = 48 cache_chunk_size = 1 use_context_fmha = 1 quant_policy = 0 tensor_para_size = 8

lvhan028 commented 1 year ago

用官方的llama-2-70b-chat是没有问题的

TexasRangers86 commented 1 year ago

我把模型转换代码（deploy.py）修改了一下，按照指定方式加载模型，目前已经正常了，就是因为字典大小的问题，感谢各位的帮助，可以关闭了

InternLM / lmdeploy