elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.28k stars 91 forks source link

Reduce StableDiffusion memory usage #147

Open josevalim opened 1 year ago

josevalim commented 1 year ago

A list of ideas to explore:

josevalim commented 9 months ago

More on attention: https://pytorch.org/blog/flash-decoding/

bfolkens commented 8 months ago

I'd also suggest FlashAttention-2 and Medusa

josevalim commented 7 months ago

Alternative to DPM Solver: https://arxiv.org/abs/2311.05556

josevalim commented 7 months ago

More notes on optimizations here:

jonatanklosko commented 7 months ago

I tested SD v1-4 on a GPU using the new lower precision options params_variant: "fp16", type: :bf16. Here are a couple runs:

Type Steps Batch, Images Time Memory Lazy transfers
bf16 20 1, 1 0.7s 4669MiB No
bf16 20 1, 4 2.2s 8769MiB No
f32 20 1, 1 1.3s 8759MiB No
f32 20 1, 4 4.3s 16951MiB No
bf16 20 1, 1 3.7s 6957MiB Yes
f32 20 1, 1 8.2s 13379MiB Yes

Note that the reported memory is just the final memory after using preallocate: false, so it's not ideally reliable. XLA even does memory reservations at compilation time, my guess is that it runs some example operations to pick preferable algorithm or fine tune algorithm parameters. That said, it seems clear that bf16 reduces both memory and time roughly by a factor of 2. Weirdly, lazy transfers seem to bump the memory usage (but it doesn't mean that much memory is required in practice, it's just XLA bumping the reservation, see below).

Source (first row) ````markdown # Stable Diffusion testing ```elixir Mix.install([ {:nx, github: "elixir-nx/nx", sparse: "nx", override: true}, {:exla, github: "elixir-nx/nx", sparse: "exla", override: true}, {:axon, github: "elixir-nx/axon", override: true}, {:kino, "~> 0.11.3"}, {:bumblebee, github: "elixir-nx/bumblebee"} ]) Application.put_env(:exla, :clients, host: [platform: :host], cuda: [platform: :cuda, preallocate: false] # cuda: [platform: :cuda, memory_fraction: 0.3] # cuda: [platform: :cuda] ) Application.put_env(:exla, :preferred_clients, [:cuda, :host]) Nx.global_default_backend({EXLA.Backend, client: :host}) ``` ## init ```elixir with {output, 0} <- System.shell("nvidia-smi --query-gpu=memory.total,memory.used --format=csv") do IO.puts(output) end ``` ## Stable Diffusion fp16 ```elixir repository_id = "CompVis/stable-diffusion-v1-4" {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"}) {:ok, clip} = Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"}, params_variant: "fp16", type: :bf16 ) {:ok, unet} = Bumblebee.load_model({:hf, repository_id, subdir: "unet"}, params_variant: "fp16", type: :bf16 ) {:ok, vae} = Bumblebee.load_model({:hf, repository_id, subdir: "vae"}, architecture: :decoder, params_variant: "fp16", type: :bf16 ) {:ok, scheduler} = Bumblebee.load_scheduler({:hf, repository_id, subdir: "scheduler"}) clip = update_in(clip.params, &Nx.backend_copy(&1, {EXLA.Backend, client: :cuda})) unet = update_in(unet.params, &Nx.backend_copy(&1, {EXLA.Backend, client: :cuda})) vae = update_in(vae.params, &Nx.backend_copy(&1, {EXLA.Backend, client: :cuda})) serving = Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler, num_steps: 20, num_images_per_prompt: 1, compile: [batch_size: 1, sequence_length: 60], defn_options: [compiler: EXLA] ) Kino.start_child({Nx.Serving, name: SD, serving: serving}) ``` ```elixir prompt = "numbat, forest, high quality, detailed, digital art" output = Nx.Serving.batched_run(SD, prompt) for result <- output.results do Kino.Image.new(result.image) end |> Kino.Layout.grid(columns: 2) ``` ````
jonatanklosko commented 7 months ago

I experimented with different values of memory_fraction as an upper limit. For the first entry in the table above:

So lazy transfers do help a bit, but imply a significant slowdown.

What's interesting though is that preallocate_params requires more memory than manual backend_copy. It's even more surprising given that the OOM happens at serving runtime, not during the params preallocation.

josevalim commented 7 months ago

preallocate/jit will transfer the data twice, one as arguments, one as return type. So we probably need a new callback/abstraction to make this easier :D

jonatanklosko commented 7 months ago

FTR fixed in #317, now preallocate_params: true effectively does backend_copy :)

josevalim commented 5 months ago

I have added an entry for LCM+Lora, @wtedw may have input here (and we may need to update/release a Axon before). /cc @seanmor5

seanmor5 commented 5 months ago

I think we should update Axon to better support LoRA, I have a draft in place right now but I have to revisit it to make it work as I intend :)

wtedw commented 5 months ago

LCM just adapts these nodes in the unet model: https://github.com/wtedw/lorax/blob/main/lib/lorax/lcm.ex#L121-L139 The weights can be found here: https://huggingface.co/latent-consistency/lcm-lora-sdv1-5

For Bumblebee, (if trying to make it compatible w/ most LoRA files in HuggingFace)

If you guys need any PRs, lmk!

josevalim commented 4 months ago

Just a heads up that Stability AI just announced Stable Diffusion 3, so that makes us wonder how much effort we should pour into SD vs SDXL vs SD3. It still probably makes sense to support LoRA on Stable Diffusion, because that will require improvements in Axon and elsewhere that we could use for other models, but custom schedulers and token merging is up to debate at the moment.

jonatanklosko commented 4 months ago

Checking off attention slicing, it has actually been removed from diffusers docs (https://github.com/huggingface/diffusers/issues/4487) because of flash attention. Either way, the trick is about slicing a dimension and using a while loop, which is similar to flash attention on defn level (as opposed to custom CUDA kernel), and that didn't turn out to be beneficial.

jonatanklosko commented 4 months ago

The main part of StableDiffusion is iterative U-Net model pass, which happens for a specified number of timesteps. DeepCache is about reusing some of the intermediate layer outputs across some diffusion iterations, that is outputs expected to change slowly over time.

This technique is not going to reduce memory usage, because we still need to periodically do a uncached model pass. Given that we need to keep the cached intermediate results, it can increase the usage if anything. It can have a significant speedup, assuming we do a fair amount of steps. For SD Turbo or LCM, where we do 1 or at most a few steps, the caching is not applicable.

So this may be something we want to explore in the future, depending on SD3 and other research going forward, but I don't think it's immediately relevant for us now.