LaurentMazare / diffusers-rs

An implementation of the diffusers api in Rust
Apache License 2.0
539 stars 55 forks source link

CUDA out of memory on 12GB GPU #61

Closed jroddev closed 1 year ago

jroddev commented 1 year ago

My GPU has 12GB memory (11GB free) but I still get CUDA out of memory.

 nvidia-smi
Thu May  4 18:32:15 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN Xp                 Off| 00000000:2B:00.0  On |                  N/A |
| 24%   40C    P0               64W / 250W|   1079MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
 cargo run --example stable-diffusion --features clap -- --prompt "A rusty robot holding a fire torch."                                                
    Finished dev [unoptimized + debuginfo] target(s) in 0.18s
     Running `target/debug/examples/stable-diffusion --prompt 'A rusty robot holding a fire torch.'
Cuda available: true
Cudnn available: true
MPS available: false
Running with prompt "hello".
Building the Clip transformer.
Building the autoencoder.
Building the unet.
Timestep 0/30
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("CUDA out of memory. Tried to allocate 3.16 GiB (GPU 0; 11.87 GiB total capacity; 8.27 GiB already allocated; 1.68 GiB free; 8.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:936 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faacc05a6bb in /home/jarrod/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x2f176 (0x7faacba2f176 in /home/jarrod/libtorch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2fc12 (0x7faacba2fc12 in /home/jarrod/libtorch/lib/libc10_cuda.so)
...
frame #34: <unknown function> + 0x23790 (0x7faacbe3c790 in /usr/lib/libc.so.6)
frame #35: __libc_start_main + 0x8a (0x7faacbe3c84a in /usr/lib/libc.so.6)
frame #36: <unknown function> + 0x4f0f5 (0x55e9bd45e0f5 in ./target/release/examples/stable-diffusion)
")', /home/jroddev/.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.11.0/src/wrappers/tensor_generated.rs:15578:36

Is there a setting/flag that I'm missing? Should 12GB be sufficient to run this?

LaurentMazare commented 1 year ago

This should be able to run on a 8GB GPU by using fp16 weights and putting only the unet weights on the GPU. There are more details in the main readme.

jroddev commented 1 year ago

Looks like v1.5 works at full precision on my 8GB GPU. I now have the v2.1 working with fp16 using this script

#!/bin/bash

export LIBTORCH=$HOME/libtorch
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH

export  PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:64
export RUST_BACKTRACE=1
export CARGO_TARGET_DIR=target2
cargo run \
    --example stable-diffusion \
    --features clap -- \
    --cpu vae \
    --cpu clip \
    --prompt "$1"

max_split_size_mb worked at 128mb, but very unreliably so I dropped it to 64mb and it seems good now. The readme only has a link to the v1.5 fp15 weights so I generated my own from stabilityai's fp16 branch at https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/fp16 and using the readme instructions here https://github.com/LaurentMazare/diffusers-rs#converting-the-original-weight-files