[BUG] <title>torch.cuda.OutOfMemoryError: CUDA out of memory.

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

INFO: 10.0.93.12:49896 - "POST /v1/chat/completions HTTP/1.1" 200 OK This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (8192). Depending on the model, you may observe exceptions, performance degradation, or nothing at all. INFO: 10.0.93.12:49903 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 116, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/cors.py", line 83, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 55, in wrapped_app raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 44, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 746, in call await route.handle(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 75, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 55, in wrapped_app raise exc File "/usr/local/lib/python3.8/dist-packages/starlette/_exception_handler.py", line 44, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 70, in app response = await func(request) File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 299, in app raise e File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 294, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(values) File "openai_api.py", line 416, in create_chatcompletion response, = model.chat( File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1139, in chat outputs = self.generate( File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1261, in generate return super().generate( File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1764, in generate return self.sample( File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2861, in sample outputs = self( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1045, in forward transformer_outputs = self.transformer( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 893, in forward outputs = block( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 612, in forward attn_outputs = self.attn( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 474, in forward value = torch.cat((past_value, value), dim=1) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacty of 44.55 GiB of which 23.25 MiB is free. Process 57226 has 44.52 GiB memory in use. Of the allocated memory 40.10 GiB is allocated by PyTorch, and 4.10 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

我使用的模型是Qwen-14B，显卡如下：截屏2024-05-14 11 11 24

期望行为 | Expected Behavior

刚启动时占用显存不到30G，不到多久GPU基本上被占满，且日志显示： This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (8192). Depending on the model, you may observe exceptions, performance degradation, or nothing at all. 再过许久，就会报错

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
- 采用docker运行方式，自己微调的模型

        "Cmd": [
            "python",
            "openai_api.py",
            "--server-port",
            "80",
            "--server-name",
            "0.0.0.0",
            "-c",
            "/data/shared/Qwen/Qwen-Chat/"
        ],

备注 | Anything else?

No response

QwenLM / Qwen