black-forest-labs / flux

Official inference repo for FLUX.1 models
Apache License 2.0
16.18k stars 1.17k forks source link

flux-dev with 24G gpu ram got CUDA out of memory #120

Open movelikeriver opened 3 months ago

movelikeriver commented 3 months ago

which GPU does flux run on?

running in google cloud: NVIDIA L4, 23034MiB

command line:

$ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python demo_gr.py --name flux-dev --device cuda --share

got CUDA out of memory:

$ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python demo_gr.py --name flux-dev --device cuda --share
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Init model
Loading checkpoint
Traceback (most recent call last):
  File "/home/ratc/flux/demo_gr.py", line 217, in <module>
    demo = create_demo(args.name, args.device, args.offload)
  File "/home/ratc/flux/demo_gr.py", line 163, in create_demo
    generator = FluxGenerator(model_name, device, offload)
  File "/home/ratc/flux/demo_gr.py", line 33, in __init__
    self.model, self.ae, self.t5, self.clip, self.nsfw_classifier = get_models(
  File "/home/ratc/flux/demo_gr.py", line 22, in get_models
    model = load_flow_model(name, device="cpu" if offload else device)
  File "/home/ratc/flux/src/flux/util.py", line 123, in load_flow_model
    sd = load_sft(ckpt_path, device=str(device))
  File "/opt/conda/lib/python3.10/site-packages/safetensors/torch.py", line 315, in load_file
    result[k] = f.get_tensor(k)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 126.00 MiB. GPU 0 has a total capacity of 21.95 GiB of which 40.12 MiB is free. Including non-PyTorch memory, this process has 21.90 GiB memory in use. Of the allocated memory 21.71 GiB is allocated by PyTorch, and 16.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
GallonDeng commented 3 months ago

flux-schnell also out of memeory with 4090 24G; it uses only cuda:0 default. Is there a way to use two GPUs?

leefernandes commented 3 months ago

When this happens how do you dealloc?

GallonDeng commented 2 months ago

it is OK with offload mode with only one 4090(24G) ; but it runs too slow, about 25s per image. I find a way to speed up with two 4090 GPUs, just load T5, CLIP, and AE model into one GPU, and the main flow model into another GPU. Then it runs about 2.3s per image, x10 faster without offload

ZNan-Chen commented 2 months ago

it is OK with offload mode with only one 4090(24G) ; but it runs too slow, about 25s per image. I find a way to speed up with two 4090 GPUs, just load T5, CLIP, and AE model into one GPU, and the main flow model into another GPU. Then it runs about 2.3s per image, x10 faster without offload

I have tried the same method as yours and there is no significant time improvement, can you provide your script?

GallonDeng commented 2 months ago

I changed the script "demo_gr.py" as follows (not clean but should work with offload mode for only one GPU and no offload mode with GPU0 and GPU1) demo_gr.txt

851695e35 commented 2 months ago

pipe.enable_sequential_cpu_offload() this could help

luyuhua commented 2 months ago

out of memory use flux-schnell and offload with 3090(24G). when finished inp step and run torch.cuda.empty_cache(),but it still keep about 1000MB memory,so it can only load model,but when infer with inp,it will out of memory.

actionless commented 1 month ago

thanks @GallonDeng ! with those changes i am able to run on 2x24 VRAM with resolution from 512x512 to 800x800 (depending on seed and prompt length)

UPD: weirdly, after removing gradio-related code, and running it just in a command-line - i got a stable performance for a bit higher resolution, it seems gradio was preventing gc by holding references to some objects)