Overv / vramfs

VRAM based file system for Linux
1.26k stars 65 forks source link

Although enabling vramfs, cuda-oom happens #39

Open leemgs opened 3 months ago

leemgs commented 3 months ago

Hello. I want to use vramfs as a swap space for Nvidia GPU Memory. So after reading the README.md file, I set vramfs to 20GB space. When I executed the nvidia-smi command, I was happy to see that vramfs was grabbed as 20GB as shown below.

# vramfs /tmp/vram 20G
# nvidia-smi
Tue Jun 18 13:22:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:21:00.0 Off |                    0 |
| N/A   40C    P0              65W / 300W |  76773MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2867      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A   1856687      C   bin/vramfs                                20892MiB | <--- 20GB for OpenCL
|    0   N/A  N/A   1906793      C   /opt/conda/bin/python3.10                 51754MiB |
|    0   N/A  N/A   1988805      C   /usr/bin/python                            2670MiB |
|    0   N/A  N/A   3729345      C   /usr/bin/python                            1418MiB |
+---------------------------------------------------------------------------------------+

And then, I also created the /tmp/vram/swapfile with 10GB as follows.

# cd /tmp/vram
# LOOPDEV=$(losetup -f)
# truncate -s 10G swapfile # replace 10G with target swapspace size, has to be smaller than the allocated vramfs (e.g. 20G)
# losetup $LOOPDEV swapfile
# mkswap $LOOPDEV
# swapon $LOOPDEV
# cat /proc/swaps
   Filename                                Type            Size            Used            Priority
   /dev/loop7                              partition       10485756        0               -3

# vi /etc/security/limits.conf
leemgs hard memlock unlimited
leemgs soft memlock unlimited
leemgs hard rtprio unlimited
leemgs soft rtprio unlimited

However, when I used an open source project called axolotl (https://github.com/OpenAccess-AI-Collective/axolotl) to run the model training as shown below, I got a cuda-oom error (e.g., torch.cuda.OutOfMemoryError: CUDA out of memory). I got a cuda-oom error when I ran the model training like below.

$ accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml

cat /proc/swaps

Filename Type Size Used Priority /dev/loop7 partition 10485756 0 -2


As you can see, the used swap space of /dev/loop7 is still 0. It's weird. 

So I was wondering, is it possible to use `vramfs` as a swap space for Nvidia GPUs by using vramfs? Welcome to any hints or clue. 
Overv commented 3 months ago

Am I understanding correctly that you want to use GPU memory as swap space for GPU memory?