QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] <title>torch.cuda.OutOfMemoryError: CUDA out of memory. #1250

Closed Kaizan-wyl closed 4 months ago

Kaizan-wyl commented 4 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

INFO: 10.0.93.12:49896 - "POST /v1/chat/completions HTTP/1.1" 200 OK This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (8192). Depending on the model, you may observe exceptions, performance degradation, or nothing at all. INFO: 10.0.93.12:49903 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 116, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/cors.py", line 83, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 55, in wrapped_app raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 44, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 746, in call await route.handle(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 75, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 55, in wrapped_app raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 44, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 70, in app response = await func(request) File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 299, in app raise e File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 294, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(values) File "openai_api.py", line 416, in create_chatcompletion response, = model.chat( File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1139, in chat outputs = self.generate( File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1261, in generate return super().generate( File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1764, in generate return self.sample( File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2861, in sample outputs = self( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1045, in forward transformer_outputs = self.transformer( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 893, in forward outputs = block( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 612, in forward attn_outputs = self.attn( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 474, in forward value = torch.cat((past_value, value), dim=1) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacty of 44.55 GiB of which 23.25 MiB is free. Process 57226 has 44.52 GiB memory in use. Of the allocated memory 40.10 GiB is allocated by PyTorch, and 4.10 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

我使用的模型是Qwen-14B,显卡如下: 截屏2024-05-14 11 11 24

期望行为 | Expected Behavior

刚启动时占用显存不到30G,不到多久GPU基本上被占满,且日志显示: This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (8192). Depending on the model, you may observe exceptions, performance degradation, or nothing at all. 再过许久,就会报错

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
- 采用docker运行方式,自己微调的模型
        "Cmd": [
            "python",
            "openai_api.py",
            "--server-port",
            "80",
            "--server-name",
            "0.0.0.0",
            "-c",
            "/data/shared/Qwen/Qwen-Chat/"
        ],

备注 | Anything else?

No response

Kaizan-wyl commented 4 months ago

经过详细测试,发现是因为请求了一个4461 tokens 的query,导致显存占满,接着日志打印: This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (8192). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.

在接着就是报错