[Feature] QWen推理 - Githubissues

InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

https://lmdeploy.readthedocs.io/en/latest/

Apache License 2.0

4.2k stars 380 forks source link

[Feature] QWen推理 #478

Closed Vincent131499 closed 11 months ago

Vincent131499 commented 11 months ago

Motivation

看到turbomind框架支持了QWen-7b-chat,想问下这个模型怎么转换呢，具体命令是啥啊，我这边使用 python3 -m lmdeploy.serve.turbomind.deploy qwen-7b ../pretrain_models/Qwen-7B-chat/ qwen --tp 1 -d ./workspace-qwen-7b-chat-fp16-1gpu/ 出现如下信息： create workspace in directory ./workspace-qwen-7b-chat-fp16-1gpu/ copy triton model templates from "/workspace/lmdeploy-master/lmdeploy/lmdeploy/serve/turbomind/triton_models" to "./workspace-qwen-7b-chat-fp16-1gpu/triton_models" successfully tokenizer model None does not exist

Related resources

No response

Additional context

No response

lvhan028 commented 11 months ago

python3 -m lmdeploy.serve.turbomind.deploy qwen-7b /nvme/shared_data/Qwen/Qwen-7B-Chat/ --dst-path workspace/qwen-7b-chat

Vincent131499 commented 11 months ago

python3 -m lmdeploy.serve.turbomind.deploy qwen-7b /nvme/shared_data/Qwen/Qwen-7B-Chat/ --dst-path workspace/qwen-7b-chat

收到，感谢，我试下。另外Qwen-14B-Chat出来了，这个模型也可以直接转换吗

lvhan028 commented 11 months ago

@irexyc 在适配中

irexyc commented 11 months ago

@Vincent131499

fp16的话，可以转。对话模版Qwen-14B-Chat 和 Qwen-7B-Chat 是一样的。

python3 -m lmdeploy.serve.turbomind.deploy qwen-7b /nvme/shared_data/Qwen/Qwen-14B-Chat/ --dst-path workspace/qwen-14b-chat

不过14b 用的是safetensors，你需要改一下加载权重的方式 https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/serve/turbomind/deploy.py#L840-L847

    _files = [file for file in os.listdir(model_path) if file.endswith('.safetensors')]
    _files = sorted(_files)
    print(_files)

    _params = {}
    for _file in _files:
        _tmp = {}
        from safetensors import safe_open
        with safe_open(osp.join(model_path, _file), framework='pt') as f:
            for k in f.keys():
                _tmp[k] = f.get_tensor(k)
        # _tmp = torch.load(osp.join(model_path, _file), map_location='cpu')
        _params.update(_tmp)

lvhan028 commented 11 months ago

@Vincent131499

fp16的话，可以转。对话模版Qwen-14B-Chat 和 Qwen-7B-Chat 是一样的。

python3 -m lmdeploy.serve.turbomind.deploy qwen-7b /nvme/shared_data/Qwen/Qwen-14B-Chat/ --dst-path workspace/qwen-14b-chat

加一个 qwen-14b的tag吧

Vincent131499 commented 11 months ago

@irexyc 感谢，改了加载权重的代码可以work。有两个疑问： 1.7b和14b转完之后都可以使用kv-int8？ 2.目前正在适配W4A16吗？

irexyc commented 11 months ago

都可以使用kv-int8。
awq 量化qwen 结果有问题，目前正在开发GPTQ算法的支持，有进展了会在主页更新