[Bug] [Module Name] torch.cuda.OutOfMemoryError: CUDA out of memory.

yuerf commented 3 months ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Operating system information

Linux

Python version information

=3.11

DB-GPT version

main

Related scenes

[ ] Chat Data
[ ] Chat Excel
[ ] Chat DB
[ ] Chat Knowledge
[ ] Model Management
[ ] Dashboard
[ ] Plugins

Installation Information

Device information

-

Models information

-

What happened

Traceback (most recent call last): File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/threading.py", line 982, in run self._target(*self._args, self._kwargs) File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate result = self._sample( ^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/transformers/generation/utils.py", line 2791, in _sample outputs = self( ^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/test2/anaconda3/envs/dbgpt_new/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1183, in forward logits = logits.float() ^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 416.00 MiB. GPU 2 has a total capacity of 23.64 GiB of which 269.12 MiB is free. Process 4450 has 13.05 GiB memory in use. Including non-PyTorch memory, this process has 10.30 GiB memory in use. Of the allocated memory 9.69 GiB is allocated by PyTorch, and 164.85 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

GPU 2明明还有富余很多，显存分配是不是有问题？显卡id对应错误？

What you expected to happen

-

How to reproduce

CUDA_VISIBLE_DEVICES=1,2,3

Additional context

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

fangyinc commented 3 months ago

Similar issues #839

github-actions[bot] commented 2 months ago

This issue has been marked as stale, because it has been over 30 days without any activity.

github-actions[bot] commented 1 month ago

This issue bas been closed, because it has been marked as stale and there has been no activity for over 7 days.

eosphoros-ai / DB-GPT