eosphoros-ai / DB-GPT

AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents
http://docs.dbgpt.cn
MIT License
13.39k stars 1.78k forks source link

[Bug] [Module Name] torch.cuda.OutOfMemoryError: CUDA out of memory. #1679

Closed yuerf closed 1 month ago

yuerf commented 3 months ago

Search before asking

Operating system information

Linux

Python version information

=3.11

DB-GPT version

main

Related scenes

Installation Information

Device information

-

Models information

-

What happened

image Traceback (most recent call last): File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/threading.py", line 982, in run self._target(*self._args, self._kwargs) File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate result = self._sample( ^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/transformers/generation/utils.py", line 2791, in _sample outputs = self( ^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1183, in forward logits = logits.float() ^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 416.00 MiB. GPU 2 has a total capacity of 23.64 GiB of which 269.12 MiB is free. Process 4450 has 13.05 GiB memory in use. Including non-PyTorch memory, this process has 10.30 GiB memory in use. Of the allocated memory 9.69 GiB is allocated by PyTorch, and 164.85 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

image GPU 2明明还有富余很多,显存分配是不是有问题?显卡id对应错误?

What you expected to happen

-

How to reproduce

CUDA_VISIBLE_DEVICES=1,2,3

Additional context

No response

Are you willing to submit PR?

fangyinc commented 3 months ago

Similar issues #839

github-actions[bot] commented 2 months ago

This issue has been marked as stale, because it has been over 30 days without any activity.

github-actions[bot] commented 1 month ago

This issue bas been closed, because it has been marked as stale and there has been no activity for over 7 days.