black-forest-labs / flux

Official inference repo for FLUX.1 models
Apache License 2.0
15.62k stars 1.12k forks source link

CUDA out of memory PROBLEM SOLUTION #185

Open VadimPoliakov opened 4 days ago

VadimPoliakov commented 4 days ago

Reason of this issue in really big models, which are more than 60GB. So diffusers tries to put all of them to GPU VRAM. Now there are couple ways to fix it.

First one is to add this line of code to your script:

pipe.enable_sequential_cpu_offload()

You will now be able start your scripts, bit it will be kinda slow.

Second way is to quantize your models. Here I write the examples of code for different ways of using with different models:

# This one is for using with Flux.1-dev for generating images
import torch
from diffusers import FluxTransformer2DModel, FluxPipeline

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)

pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")
# this one for upscaling images with jasperai/Flux.1-dev-Controlnet-Upscaler
import torch
from diffusers.utils import load_image
from diffusers import FluxControlNetModel, BitsAndBytesConfig, FluxTransformer2DModel
from diffusers.pipelines import FluxControlNetPipeline

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

controlnet = FluxControlNetModel.from_pretrained(
    "jasperai/Flux.1-dev-Controlnet-Upscaler",
    quantization_config=nf4_config,
)

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.float16)

pipe = FluxControlNetPipeline.from_pretrained(
    model_id,
    transformer=model_nf4,
    torch_dtype=torch.float16,
    controlnet=controlnet
)
pipe.enable_model_cpu_offload()

control_image = load_image(
    "image.jpg"
)

image = pipe(
    prompt="", 
    control_image=control_image,
    controlnet_conditioning_scale=0.6,
    num_inference_steps=28, 
    guidance_scale=3.5,
    height=control_image.size[1],
    width=control_image.size[0]
).images[0]
image.save("upscaled_img_quanted.png")

For this solutions we must to say thank you to @sayakpaul

Jaid844 commented 4 days ago

Hi, @VadimPoliakov
I am using A10 GPU 48 VRAM in run pod which is ample for the flux model it is running smoothly in jupyter notebook. But while deployment with fastapi I am getting issue of cuda out of memory issue. This issue is with also for quantized model.
Any help would be appreciated. Thanks! cc @sayakpaul

VadimPoliakov commented 4 days ago

Hi, @VadimPoliakov I am using A10 GPU 48 VRAM in run pod which is ample for the flux model it is running smoothly in jupyter notebook. But while deployment with fastapi I am getting issue of cuda out of memory issue. This issue is with also for quantized model. Any help would be appreciated. Thanks! cc @sayakpaul

Hi. I`m not sure. But it seems like problem with simultaneously proccessing more than 1 images. Try to use queues for that.

Jaid844 commented 4 days ago

No ,the problem is with when you stage your deployment, instead of starting an API gives out cuda memory issue .

VadimPoliakov commented 4 days ago

No ,the problem is with when you stage your deployment, instead of starting an API gives out cuda memory issue .

If you start on several workers, it means several times diffusers tries to put all models to GPU VRAM. Make the separate service not FastAPI with queues with no workers. And in your FastAPI service just use this service.

Jaid844 commented 3 days ago

Thanks bro! For the help.