Getting OutOfMemoryError for inferencing the model☹️

OnlinePage commented 9 months ago

Hi I am trying to run the model using on lambdalabs GPUs instances a10 and h100, However I am facing every time OutOfMemoryError on both of them.

torch.cuda.OutOfMemoryError: CUDA out of memory. 
Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 67.00 MiB is free. 
Process 23489 has 21.91 GiB memory in use. 
Of the allocated memory 21.69 GiB is allocated by PyTorch, and 1.55 MiB is reserved by PyTorch but unallocated. 
If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Below is error trace with some GPU logs.


TORCH GPU IS AVAILABLE :  True
TORCH VERSION:  2.0.1+cu118
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Requested memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| GPU reserved memory   |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

TORCH GPU IS AVAILABLE :  True
TORCH VERSION:  2.0.1+cu118
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Requested memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| GPU reserved memory   |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|
A new version of the following files was downloaded from https://huggingface.co/THUDM/cogagent-chat-hf:
- cross_visual.py
- visual.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Please 'pip install apex'
Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]
Loading checkpoint shards:  12%|█▎        | 1/8 [00:00<00:01,  5.63it/s]
Loading checkpoint shards:  25%|██▌       | 2/8 [00:00<00:01,  5.14it/s]
Loading checkpoint shards:  38%|███▊      | 3/8 [00:00<00:01,  4.83it/s]
Loading checkpoint shards:  50%|█████     | 4/8 [00:00<00:00,  4.80it/s]
Loading checkpoint shards:  62%|██████▎   | 5/8 [00:01<00:00,  4.57it/s]
Loading checkpoint shards:  75%|███████▌  | 6/8 [00:01<00:00,  4.24it/s]
Loading checkpoint shards:  88%|████████▊ | 7/8 [00:01<00:00,  3.83it/s]
Loading checkpoint shards: 100%|██████████| 8/8 [00:02<00:00,  3.00it/s]
Loading checkpoint shards: 100%|██████████| 8/8 [00:02<00:00,  3.76it/s]
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/cog/server/worker.py", line 185, in _setup
run_setup(self._predictor)
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/cog/predictor.py", line 66, in run_setup
predictor.setup()
File "/src/predict.py", line 33, in setup
.to("cuda")
^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2271, in to
return super().to(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ⅹ Model setup failed

I tried multiple times , but everytimes it fails my inference code is via cog

Any help is highly appreciated🙏!!

1049451037 commented 9 months ago

Use load_in_4bit=True if you have limited GPU memory.

zRzRzRzRzRzRzR commented 9 months ago

try to set different model device mapping because your two card are different, or try int4

OnlinePage commented 9 months ago

@zRzRzRzRzRzRzR @1049451037 i tried to run it on cloud GPU once on h100 48GB and then on a10 24GB ,on both of them it throwed the same error☹️

1049451037 commented 9 months ago

I don't know what code you are running. It's your code problem. We have provided the code that can run on 24GB GPU.

1049451037 commented 9 months ago

python cli_demo_hf.py --from_pretrained THUDM/cogagent-chat-hf --fp16 --quant 4

THUDM / CogVLM

Getting OutOfMemoryError for inferencing the model☹️ #236