Closed hello-gary-2022 closed 4 months ago
PR #1430 在处理这个问题
lmdeploy 推理 qwen1.5-awq 模型,会在近期合入。发版时间定在 4.23
但是0.5B 的awq 似乎 PR 1430 也推理不了吧
抱歉,没有注意到是0.5B的模型。 是的。0.5B 模型,目前 lmdeploy turbomind engine支持不了。turbomind engine 支持1.8B 及以上的 https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html#models-supported-by-turbomind
huggingface-cli download --resume-download Qwen/Qwen1.5-1.8B-Chat --local-dir /kaggle/working/Qwen
lmdeploy serve api_server /kaggle/working/Qwen --backend turbomind --model-format hf --server-port 23333 --tp 2 --cache-max-entry-count 0.2 --model-name qwen2
启动后,访问 Chat 接口,直接报错。
2024-04-13 09:58:47,584 - lmdeploy - WARNING - Fallback to pytorch engine because /kaggle/working/Qwen
not supported by turbomind engine.
2024-04-13 09:58:59,697 - lmdeploy - INFO - distribute model parameters.
2024-04-13 09:59:04,424 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=682, num_gpu_blocks=259, window_size=-1, cache_max_entry_count=0.2, max_prefill_token_num=4096)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO: Started server process [170]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
INFO: 2409:8a00:2643:c5f0:2c8a:7603:8319:ec97:0 - "GET / HTTP/1.1" 200 OK
INFO: 2409:8a00:2643:c5f0:2c8a:7603:8319:ec97:0 - "GET /openapi.json HTTP/1.1" 200 OK
INFO: 2409:8a00:2643:c5f0:2c8a:7603:8319:ec97:0 - "GET /v1/models HTTP/1.1" 200 OK
python3.10: /project/lib/Dialect/TritonGPU/Transforms/OptimizeThreadLocality.cpp:101: virtual void TritonGPUOptimizeThreadLocalityPass::runOnOperation(): Assertion `loopResult.hasOneUse()' failed.
这个功能开发完成,但是还没有发版。可以参考链接中的文档,编译源码后,再使用 https://lmdeploy.readthedocs.io/en/latest/build.html#build-in-docker-recommended
Checklist
Describe the bug
运行下面的命令: lmdeploy serve api_server Qwen/Qwen1.5-0.5B-Chat-AWQ --server-port 23333 --cache-max-entry-count 0.1 --tp 2
报如下错误: 2024-04-11 07:16:46,336 - lmdeploy - ERROR - rank[0] failed with error: model.layers.0.mlp.down_proj.qweight doesn't have any device set.
并且 CPU和GPU一致在运行:
CPU 202.00%
GPU 100.00% GPU Memory:331MB GPU 100.00% GPU Memory:895MB
Reproduction
lmdeploy serve api_server Qwen/Qwen1.5-0.5B-Chat-AWQ --server-port 23333 --cache-max-entry-count 0.1 --tp 2
Environment
Error traceback