fudan-zvg / meta-prompts

MIT License
69 stars 2 forks source link

How to train this model with limited GPU memory? #3

Closed LT1st closed 9 months ago

LT1st commented 9 months ago

I am using two 3090 GPU with 24GB memory in each one, but I faced torch.cuda.OutOfMemoryError: CUDA out of memory error.

How can I use it?

(metap) (base) spai@spai-WS-E900-G4-WS980T:~/code/SD/meta-prompts/depth$ nvidia-smi
Wed Mar  6 19:01:24 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:5E:00.0  On |                  N/A |
|  0%   32C    P8    28W / 350W |    365MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:AF:00.0 Off |                  N/A |
| 53%   32C    P8    21W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     16179      G   /usr/lib/xorg/Xorg                363MiB |
|    1   N/A  N/A     16179      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
Traceback (most recent call last):
  File "train.py", line 375, in <module>
    main()
  File "train.py", line 163, in main
    loss_train = train(train_loader, model, criterion_d, log_txt, optimizer=optimizer, 
  File "train.py", line 259, in train
    optimizer.step()
  File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/optim/adamw.py", line 171, in step
    adamw(
  File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/optim/adamw.py", line 321, in adamw
    func(
  File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw
    denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 23.67 GiB total capacity; 21.50 GiB already allocated; 73.75 MiB free; 21.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ascacdsaa commented 8 months ago

您好哥,我在使用两张4090进行训练,也是爆显存了,看到您之前的提问,想问问您如何解决的,谢谢!

LT1st commented 8 months ago

换显卡了