Closed jroddev closed 1 year ago
This should be able to run on a 8GB GPU by using fp16 weights and putting only the unet weights on the GPU. There are more details in the main readme.
Looks like v1.5 works at full precision on my 8GB GPU. I now have the v2.1 working with fp16 using this script
#!/bin/bash
export LIBTORCH=$HOME/libtorch
export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH
export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:64
export RUST_BACKTRACE=1
export CARGO_TARGET_DIR=target2
cargo run \
--example stable-diffusion \
--features clap -- \
--cpu vae \
--cpu clip \
--prompt "$1"
max_split_size_mb
worked at 128mb, but very unreliably so I dropped it to 64mb and it seems good now.
The readme only has a link to the v1.5 fp15 weights so I generated my own from stabilityai's fp16 branch at https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/fp16 and using the readme instructions here https://github.com/LaurentMazare/diffusers-rs#converting-the-original-weight-files
My GPU has 12GB memory (11GB free) but I still get
CUDA out of memory.
Is there a setting/flag that I'm missing? Should 12GB be sufficient to run this?