FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

RuntimeError: CUDA error: out of memory | WSL2 | RTX 3090 | OPT-6.7B #47

Closed ekiwi111 closed 1 year ago

ekiwi111 commented 1 year ago

Problem

Clean git clone. Running this command python -m flexgen.flex_opt --model facebook/opt-6.7b gives the following output:

I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
model size: 12.386 GB, cache size: 1.062 GB, hidden size (prefill): 0.017 GB
warmup - init weights
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/Developer/FlexGen/flexgen/pytorch_backend.py", line 881, in copy_worker_func
    cpu_buf = torch.empty((1 * GB,), dtype=torch.float16, pin_memory=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Setup

System:

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   55C    P0   112W / 350W |   2363MiB / 24576MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

top output:

top - 14:37:12 up 29 min,  0 users,  load average: 0.02, 0.08, 0.08
Tasks:  12 total,   1 running,  11 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64108.9 total,  57727.9 free,    195.4 used,   6185.6 buff/cache
MiB Swap:  16384.0 total,  16384.0 free,      0.0 used.  63209.2 avail Mem

df -h / output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        251G  102G  137G  43% /

Please help!

ekiwi111 commented 1 year ago

Never mind, there's something to do with how pin_memory works in WSL. Taken from here - https://github.com/microsoft/WSL/issues/8447#issuecomment-1235512935 Haven't solved the issue, but set up conda environment on the windows host.