InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.75k stars 432 forks source link

[Bug] 使用lmdeploy推理internvl2-40B出错 #2265

Open hitzhu opened 3 months ago

hitzhu commented 3 months ago

Checklist

Describe the bug

ERROR:asyncio:Exception in callback _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19 handle: <Handle _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19> Traceback (most recent call last): File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args) File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 26, in _raise_exception_on_finish raise e File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 22, in _raise_exception_on_finish task.result() File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/futures.py", line 201, in result raise self._exception.with_traceback(self._exception_tb) File "/opt/conda/envs/python3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 151, in forward outputs = self.model.forward(inputs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 172, in forward return self._forward_func(images) File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 153, in _forward_v1_5 outputs = self.model.extract_feature(outputs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_internvl_chat.py", line 176, in extract_feature vit_embeds = self.vision_model( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 418, in forward encoder_outputs = self.encoder( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 354, in forward layer_outputs = encoder_layer( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward args, kwargs = module._hf_hook.pre_forward(module, args, kwargs) File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 363, in pre_forward return send_to_device(args, self.execution_device), send_to_device( File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 174, in send_to_device return honor_type( File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 81, in honor_type return type(obj)(generator) File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor) File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device return tensor.to(device, non_blocking=non_blocking) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Reproduction

rt

Environment

rt

Error traceback

ERROR:asyncio:Exception in callback _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19
handle: <Handle _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19>
Traceback (most recent call last):
  File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 26, in _raise_exception_on_finish
    raise e
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 22, in _raise_exception_on_finish
    task.result()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 151, in forward
    outputs = self.model.forward(inputs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 172, in forward
    return self._forward_func(images)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 153, in _forward_v1_5
    outputs = self.model.extract_feature(outputs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_internvl_chat.py", line 176, in extract_feature
    vit_embeds = self.vision_model(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 418, in forward
    encoder_outputs = self.encoder(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 354, in forward
    layer_outputs = encoder_layer(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 363, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 174, in send_to_device
    return honor_type(
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 81, in honor_type
    return type(obj)(generator)
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in <genexpr>
    tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
irexyc commented 3 months ago

用tp了么?

启动程序之前,export CUDA_LAUNCH_BLOCKING=1,先设置环境变量,然后再跑的结果如何呢?

hitzhu commented 3 months ago

用tp了么?

启动程序之前,export CUDA_LAUNCH_BLOCKING=1,先设置环境变量,然后再跑的结果如何呢?

用了tp=4,A100,不用的话模型放不下,加了之后还是一样的错误

irexyc commented 3 months ago

创建pipeline / server的时候,cache_max_entry_count 设成0.1来减少kvcache的用量试试看的。vision的部分复用的上游的代码,感觉出问题的概率不太大,这里怀疑可能是显存不足导致的,模型启动后的剩余显存有多少呢。

hitzhu commented 3 months ago

创建pipeline / server的时候,cache_max_entry_count 设成0.1来减少kvcache的用量试试看的。vision的部分复用的上游的代码,感觉出问题的概率不太大,这里怀疑可能是显存不足导致的,模型启动后的剩余显存有多少呢。

已解决,4张A100 tp==4出错,但是2张tp=2可以

irexyc commented 3 months ago

我觉得不算解决,并不清楚原因是什么

hitzhu commented 3 months ago

我觉得不算解决,并不清楚原因是什么

会不会是tp数不同,模型split策略不同导致的

irexyc commented 3 months ago

感觉不是,方便的话,可以试下在这个镜像里面会不会报错。 https://hub.docker.com/r/openmmlab/lmdeploy/tags

haoduoyu1203 commented 3 months ago

我遇到了同样的问题,单机一张3090一张2080ti 22g。以下是环境信息 sys.platform: linux Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda-12.1 NVCC: Cuda compilation tools, release 12.1, V12.1.66 GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 2.2.2+cu121 PyTorch compiling details: PyTorch built with:

TorchVision: 0.17.2+cu121 LMDeploy: 0.5.3+9f3e748 transformers: 4.42.4 gradio: 3.50.2 fastapi: 0.111.1 pydantic: 2.8.2 triton: 2.2.0 NVIDIA Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

DefTruth commented 3 months ago

同样的问题,会偶发在这段卡主:

File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)

trace到是在accelerate send_to_device函数没有返回