InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.2k stars 380 forks source link

[Feature] QWen推理 #478

Closed Vincent131499 closed 11 months ago

Vincent131499 commented 11 months ago

Motivation

看到turbomind框架支持了QWen-7b-chat,想问下这个模型怎么转换呢,具体命令是啥啊,我这边使用 python3 -m lmdeploy.serve.turbomind.deploy qwen-7b ../pretrain_models/Qwen-7B-chat/ qwen --tp 1 -d ./workspace-qwen-7b-chat-fp16-1gpu/ 出现如下信息: create workspace in directory ./workspace-qwen-7b-chat-fp16-1gpu/ copy triton model templates from "/workspace/lmdeploy-master/lmdeploy/lmdeploy/serve/turbomind/triton_models" to "./workspace-qwen-7b-chat-fp16-1gpu/triton_models" successfully tokenizer model None does not exist

Related resources

No response

Additional context

No response

lvhan028 commented 11 months ago

python3 -m lmdeploy.serve.turbomind.deploy qwen-7b /nvme/shared_data/Qwen/Qwen-7B-Chat/ --dst-path workspace/qwen-7b-chat

Vincent131499 commented 11 months ago

python3 -m lmdeploy.serve.turbomind.deploy qwen-7b /nvme/shared_data/Qwen/Qwen-7B-Chat/ --dst-path workspace/qwen-7b-chat

收到,感谢,我试下。 另外Qwen-14B-Chat出来了,这个模型也可以直接转换吗

lvhan028 commented 11 months ago

@irexyc 在适配中

irexyc commented 11 months ago

@Vincent131499

fp16的话,可以转。对话模版Qwen-14B-Chat 和 Qwen-7B-Chat 是一样的。

python3 -m lmdeploy.serve.turbomind.deploy qwen-7b /nvme/shared_data/Qwen/Qwen-14B-Chat/ --dst-path workspace/qwen-14b-chat

不过14b 用的是safetensors,你需要改一下加载权重的方式 https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/serve/turbomind/deploy.py#L840-L847

    _files = [file for file in os.listdir(model_path) if file.endswith('.safetensors')]
    _files = sorted(_files)
    print(_files)

    _params = {}
    for _file in _files:
        _tmp = {}
        from safetensors import safe_open
        with safe_open(osp.join(model_path, _file), framework='pt') as f:
            for k in f.keys():
                _tmp[k] = f.get_tensor(k)
        # _tmp = torch.load(osp.join(model_path, _file), map_location='cpu')
        _params.update(_tmp)
lvhan028 commented 11 months ago

@Vincent131499

fp16的话,可以转。对话模版Qwen-14B-Chat 和 Qwen-7B-Chat 是一样的。

python3 -m lmdeploy.serve.turbomind.deploy qwen-7b /nvme/shared_data/Qwen/Qwen-14B-Chat/ --dst-path workspace/qwen-14b-chat

加一个 qwen-14b的tag吧

Vincent131499 commented 11 months ago

@irexyc 感谢,改了加载权重的代码可以work。有两个疑问: 1.7b和14b转完之后都可以使用kv-int8? 2.目前正在适配W4A16吗?

irexyc commented 11 months ago
  1. 都可以使用kv-int8。
  2. awq 量化qwen 结果有问题,目前正在开发GPTQ算法的支持,有进展了会在主页更新