hello-gary-2022 commented 5 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

运行下面的命令： lmdeploy serve api_server Qwen/Qwen1.5-0.5B-Chat-AWQ --server-port 23333 --cache-max-entry-count 0.1 --tp 2

报如下错误： 2024-04-11 07:16:46,336 - lmdeploy - ERROR - rank[0] failed with error: model.layers.0.mlp.down_proj.qweight doesn't have any device set.

并且 CPU和GPU一致在运行：

CPU 202.00%
GPU 100.00% GPU Memory：331MB GPU 100.00% GPU Memory：895MB

Reproduction

lmdeploy serve api_server Qwen/Qwen1.5-0.5B-Chat-AWQ --server-port 23333 --cache-max-entry-count 0.1 --tp 2

Environment

GPU：T4 * 2 
单个GPU显存 16G

pip install lmdeploy==0.3.0
pip install autoawq

Error traceback


## 启动报错
model.safetensors: 100%|██████████████████████| 783M/783M [00:03<00:00, 217MB/s]
Fetching 11 files: 100%|████████████████████████| 11/11 [00:03<00:00,  2.87it/s]
2024-04-11 07:16:39.765625: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 07:16:39.765684: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 07:16:39.767113: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-11 07:16:39.798169: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 07:16:39.798226: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 07:16:39.799584: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-11 07:16:46,336 - lmdeploy - ERROR - rank[0] failed with error: model.layers.0.mlp.down_proj.qweight doesn't have any device set.
2024-04-11 07:16:39.798169: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 07:16:39.798226: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 07:16:39.799584: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-11 07:16:46,336 - lmdeploy - ERROR - rank[0] failed with error: model.layers.0.mlp.down_proj.qweight doesn't have any device set.

## 如果加载 Qwen1.5-0.5B-Chat，强行修改config.json中的推理精度为 float16（T4 不支持bfloat16） ，则在推理的时候，出现下面的错误：

2024-04-11 11:12:01,939 - lmdeploy - WARNING - Fallback to pytorch engine because `/kaggle/working/Qwen` not supported by turbomind engine.
2024-04-11 11:12:13,324 - lmdeploy - INFO - distribute model parameters.
2024-04-11 11:12:17,318 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=1365, num_gpu_blocks=314, window_size=-1, cache_max_entry_count=0.1, max_prefill_token_num=4096)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
HINT:    Please open [http://0.0.0.0:8000](http://0.0.0.0:8000</span%3E) in a browser for detailed api usage!!!
HINT:    Please open [http://0.0.0.0:8000](http://0.0.0.0:8000</span%3E) in a browser for detailed api usage!!!
HINT:    Please open [http://0.0.0.0:8000](http://0.0.0.0:8000</span%3E) in a browser for detailed api usage!!!
INFO:     Started server process [204]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on [http://0.0.0.0:8000](http://0.0.0.0:8000</span%3E) (Press CTRL+C to quit)
INFO:     45.59.187.233:0 - "GET / HTTP/1.1" 200 OK
INFO:     45.59.187.233:0 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     45.59.187.233:0 - "GET /v1/models HTTP/1.1" 200 OK
2024-04-11 11:14:47,599 - lmdeploy - ERROR - Rank[1] failed.
2024-04-11 11:14:47,599 - lmdeploy - ERROR - Rank[0] failed.
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 994, in _start_tp_process
    func(rank, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 957, in _tp_model_loop
    output = model_forward(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 375, in model_forward
    output = patched_model.patched_forward(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/patch.py", line 243, in __call__
    output = self._model(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
    outputs = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 994, in _start_tp_process
    func(rank, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/llama.py", line 451, in forward
    return self._continuous_batching_forward(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 957, in _tp_model_loop
    output = model_forward(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 375, in model_forward
    output = patched_model.patched_forward(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/llama.py", line 418, in _continuous_batching_forward
    layer_outputs = decoder_layer(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/patch.py", line 243, in __call__
    output = self._model(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
    outputs = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/qwen2.py", line 140, in forward
    return self._contiguous_batching_forward_impl(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/llama.py", line 451, in forward
    return self._continuous_batching_forward(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/qwen2.py", line 106, in _contiguous_batching_forward_impl
    paged_attention_fwd(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/llama.py", line 418, in _continuous_batching_forward
    layer_outputs = decoder_layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/kernels/pagedattention.py", line 456, in paged_attention_fwd
    _fwd_kernel[grid](q,
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/triton/runtime/jit.py", line 532, in run
    self.cache[device][key] = compile(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 543, in compile
    next_module = compile_kernel(module)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 441, in <lambda>
    lambda src: ttgir_to_llir(src, extern_libs, target, tma_infos))
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 167, in ttgir_to_llir
    return translate_triton_gpu_to_llvmir(mod, target.capability, tma_infos, runtime.TARGET.NVVM)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/qwen2.py", line 140, in forward
    return self._contiguous_batching_forward_impl(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/models/qwen2.py", line 106, in _contiguous_batching_forward_impl
    paged_attention_fwd(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
IndexError: map::at
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/kernels/pagedattention.py", line 456, in paged_attention_fwd
    _fwd_kernel[grid](q,
  File "/opt/conda/lib/python3.10/site-packages/triton/runtime/jit.py", line 532, in run
    self.cache[device][key] = compile(
  File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 543, in compile
    next_module = compile_kernel(module)
  File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 441, in <lambda>
    lambda src: ttgir_to_llir(src, extern_libs, target, tma_infos))
  File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 167, in ttgir_to_llir
    return translate_triton_gpu_to_llvmir(mod, target.capability, tma_infos, runtime.TARGET.NVVM)
IndexError: map::at
2024-04-11 11:14:51,096 - lmdeploy - ERROR - Engine loop failed with error: Rank[0] failed.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/request.py", line 17, in _raise_exception_on_finish
    task.result()
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 759, in async_loop
    await __step(True)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 745, in __step
    raise e
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 737, in __step
    raise out
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 687, in _async_loop_background
    await self._async_step_background(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 599, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/utils.py", line 246, in __tmp
    return (await func(*args, **kwargs))
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 523, in _async_model_forward
    return await __forward(inputs)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 501, in __forward
    return await self.model_agent.async_forward(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 1171, in async_forward
    resp: TPResponse = await _async_queue_get_response(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 1040, in _async_queue_get_response
    _check_context_alive(mp_context)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 1007, in _check_context_alive
    raise RuntimeError(f'Rank[{idx}] failed.')
RuntimeError: Rank[0] failed.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 116, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 91, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 146, in simple_response
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 44, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 746, in __call__
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 75, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 44, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 70, in app
    response = await func(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/serve/openai/api_server.py", line 387, in chat_completions_v1
    async for res in result_generator:
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 589, in generate
    async for outputs in generator.async_stream_infer(
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine_instance.py", line 155, in async_stream_infer
    resp = await self.req_sender.async_recv(req_id)
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/request.py", line 320, in async_recv
    resp: Response = await self._async_resp_get()
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/request.py", line 190, in _async_resp_get
    return await __no_threadsafe_get()
  File "/opt/conda/lib/python3.10/site-packages/lmdeploy/pytorch/engine/request.py", line 175, in __no_threadsafe_get
    exit(1)
  File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
    raise SystemExit(code)
SystemExit: 1
INFO:     45.59.187.233:0 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     45.59.187.233:0 - "GET /v1/models HTTP/1.1" 200 OK

lvhan028 commented 4 months ago

PR #1430 在处理这个问题

lmdeploy 推理 qwen1.5-awq 模型，会在近期合入。发版时间定在 4.23

irexyc commented 4 months ago

但是0.5B 的awq 似乎 PR 1430 也推理不了吧

lvhan028 commented 4 months ago

抱歉，没有注意到是0.5B的模型。是的。0.5B 模型，目前 lmdeploy turbomind engine支持不了。turbomind engine 支持1.8B 及以上的 https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html#models-supported-by-turbomind

hello-gary-2022 commented 4 months ago

使用 1.8B Chat 还是不Work

复现命令

huggingface-cli download --resume-download Qwen/Qwen1.5-1.8B-Chat --local-dir /kaggle/working/Qwen

lmdeploy serve api_server /kaggle/working/Qwen --backend turbomind --model-format hf --server-port 23333 --tp 2 --cache-max-entry-count 0.2 --model-name qwen2

启动后，访问 Chat 接口，直接报错。

报错

2024-04-13 09:58:47,584 - lmdeploy - WARNING - Fallback to pytorch engine because /kaggle/working/Qwen not supported by turbomind engine. 2024-04-13 09:58:59,697 - lmdeploy - INFO - distribute model parameters. 2024-04-13 09:59:04,424 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=682, num_gpu_blocks=259, window_size=-1, cache_max_entry_count=0.2, max_prefill_token_num=4096) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! INFO: Started server process [170] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit) INFO: 2409:8a00:2643:c5f0:2c8a:7603:8319:ec97:0 - "GET / HTTP/1.1" 200 OK INFO: 2409:8a00:2643:c5f0:2c8a:7603:8319:ec97:0 - "GET /openapi.json HTTP/1.1" 200 OK INFO: 2409:8a00:2643:c5f0:2c8a:7603:8319:ec97:0 - "GET /v1/models HTTP/1.1" 200 OK python3.10: /project/lib/Dialect/TritonGPU/Transforms/OptimizeThreadLocality.cpp:101: virtual void TritonGPUOptimizeThreadLocalityPass::runOnOperation(): Assertion `loopResult.hasOneUse()' failed.

lvhan028 commented 4 months ago

这个功能开发完成，但是还没有发版。可以参考链接中的文档，编译源码后，再使用 https://lmdeploy.readthedocs.io/en/latest/build.html#build-in-docker-recommended

InternLM / lmdeploy

[Bug] lmdeploy 启动报错，rank[0] failed with error: model.layers.0.mlp.down_proj.qweight doesn't have any device set. #1422

Checklist

Describe the bug

Reproduction

Environment

Error traceback

使用 1.8B Chat 还是不Work

复现命令

报错