black-forest-labs / flux

Official inference repo for FLUX.1 models
Apache License 2.0
16.18k stars 1.17k forks source link

FLUX.1-dev runs very slow on 3090 #138

Open jerrymatjila opened 2 months ago

jerrymatjila commented 2 months ago

black-forest-labs/FLUX.1-dev runs very slow. it takes about 15min to generate 1344x768 (wxh) image. Has anyone experienced the same or is it just me.

    pipe = FluxPipeline.from_pretrained(args.model, torch_dtype=torch.bfloat16)
    #pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
    pipe.enable_sequential_cpu_offload()
    pipe.vae.enable_slicing()
    pipe.vae.enable_tiling()
    pipe.to(torch.float16) # casting here instead of in the pipeline constructor because doing so in the constructor loads all models into CPU memory at once

    prompt = args.prompt
    image = pipe(
        prompt,
        height=args.height,
        width=args.width,
        guidance_scale=0.0,
        num_inference_steps=args.num_inference_steps,
        max_sequence_length=512,
        generator=torch.Generator("cpu").manual_seed(0)
    ).images[0]
    Path(args.output).parent.mkdir(parents=True, exist_ok=True)
    image.save(args.output)

args.num_inference_steps=50

hungho77 commented 2 months ago

if having enough vram gpu, try to comment this line pipe.enable_sequential_cpu_offload()

aproust08 commented 2 months ago

@jerrymatjila, I'm having the same issue with my 3090 card. Were you able to fix it? Thx

JonasLoos commented 1 month ago

The 24GB VRAM should just be enough to keep the transformer model fully in VRAM, that means you can use pipe.enable_model_cpu_offload() instead of pipe.enable_sequential_cpu_offload(). Maybe you don't even need the vae slicing/tiling.

I.e.:

pipe = FluxPipeline.from_pretrained(args.model, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power

prompt = args.prompt
image = pipe(
    prompt,
    height=args.height,
    width=args.width,
    guidance_scale=0.0,
    num_inference_steps=args.num_inference_steps,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
Path(args.output).parent.mkdir(parents=True, exist_ok=True)
image.save(args.output)

If that still uses too more VRAM than available (see task manager), you can look into quantizing the model.