bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.
GNU Affero General Public License v3.0
1.77k stars 167 forks source link

CUDA Out of Memory Error During LoRA Training on FLUX Model with RTX 4090 #786

Closed vsakovskaya closed 2 months ago

vsakovskaya commented 2 months ago

I’m encountering a CUDA out-of-memory error while training a LoRA model using FLUX on my custom dataset. The issue occurs despite using an NVIDIA RTX 4090 with 24 GB of VRAM and 64 GB of system RAM.

Environment

Steps to reproduce

  1. Use a custom dataset of 40 images.
  2. Run the training script using the following command: bash train.sh
  3. The error occurs during the model preparation phase when attempting to allocate additional GPU memory.

Error log

CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 23.63 GiB of which 34.56 MiB is free. Process 1108671 has 558.00 MiB memory in use. Process 1114104 has 1.03 GiB memory in use. Including non-PyTorch memory, this process has 21.22 GiB memory in use. Of the allocated memory 20.73 GiB is allocated by PyTorch, and 35.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Additional information

Please see the config/config.env attached

export MODEL_TYPE='lora'

export STABLE_DIFFUSION_3=false
export PIXART_SIGMA=false
export STABLE_DIFFUSION_LEGACY=false
export KOLORS=false

export FLUX=true
export FLUX_GUIDANCE_VALUE=1.0
export FLUX_LORA_TARGET=all # options: 'all+ffs', 'all', 'context', 'mmdit', 'ai-toolkit'

export CONTROLNET=false
export USE_DORA=false

export RESUME_CHECKPOINT="latest"

export CHECKPOINTING_STEPS=150
export CHECKPOINTING_LIMIT=2

export LEARNING_RATE=1e-3 #@param {type:"number"}

export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export DEBUG_EXTRA_ARGS="--report_to=wandb"
export TRACKER_PROJECT_NAME="${MODEL_TYPE}-training"
export TRACKER_RUN_NAME="simpletuner-sdxl"

export MAX_NUM_STEPS=30000
export NUM_EPOCHS=0

export DATALOADER_CONFIG="config/multidatabackend.json"
export OUTPUT_DIR="output/models"

export PUSH_TO_HUB="false"
export HUB_MODEL_NAME=$TRACKER_PROJECT_NAME

export RESOLUTION=128
export RESOLUTION_TYPE="pixel"

export MINIMUM_RESOLUTION=$RESOLUTION

export VALIDATION_PROMPT="ethnographic photography of teddy bear at a picnic"
export VALIDATION_GUIDANCE=7.5

export VALIDATION_GUIDANCE_RESCALE=0.0
export VALIDATION_GUIDANCE_REAL=1.0

export VALIDATION_NO_CFG_UNTIL_TIMESTEP=2

export VALIDATION_STEPS=100
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="blurry, cropped, ugly"
export VALIDATION_SEED=42
export VALIDATION_RESOLUTION=$RESOLUTION

export TRAIN_BATCH_SIZE=1
export GRADIENT_ACCUMULATION_STEPS=1
export VAE_BATCH_SIZE=4

export LR_SCHEDULE="polynomial"
export LR_WARMUP_STEPS=1000

export CAPTION_DROPOUT_PROBABILITY=0.1
export METADATA_UPDATE_INTERVAL=65

export MAX_WORKERS=32
export READ_BATCH_SIZE=25
export WRITE_BATCH_SIZE=64
export IMAGE_PROCESSING_BATCH_SIZE=32
export AWS_MAX_POOL_CONNECTIONS=128
export TORCH_NUM_THREADS=8

export DELETE_ERRORED_IMAGES=0
export DELETE_SMALL_IMAGES=0

export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"
export MIN_SNR_GAMMA=5
export USE_XFORMERS=false
export USE_GRADIENT_CHECKPOINTING=true

export ALLOW_TF32=true
export OPTIMIZER="adamw_bf16"

export USE_EMA=false
export EMA_DECAY=0.999

export TRAINER_EXTRA_ARGS="--lora_rank=4 --validation_num_inference_steps=28 --accelerator_cache_clear_interval=500"

export TRAINING_SEED=42

export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.
export PURE_BF16=true

export TRAINING_NUM_PROCESSES=1
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS=""                          # --multi_gpu or other similar flags for huggingface accelerate

export TRAINING_DYNAMO_BACKEND='no'                # or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)
export TOKENIZERS_PARALLELISM=false

Full log

SimpleTuner$ bash train.sh
/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/nvidia/nvjitlink/lib
2024-08-16 17:22:00,466 [WARNING] (ArgsParser) The VAE model madebyollin/sdxl-vae-fp16-fix is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead.
2024-08-16 17:22:00,466 [INFO] (ArgsParser) VAE Model: black-forest-labs/FLUX.1-dev
2024-08-16 17:22:00,466 [INFO] (ArgsParser) Default VAE Cache location: 
2024-08-16 17:22:00,466 [INFO] (ArgsParser) Text Cache location: cache
2024-08-16 17:22:00,466 [WARNING] (ArgsParser) Updating T5 XXL tokeniser max length to 512 for Flux.
2024-08-16 17:22:00,466 [WARNING] (ArgsParser) Flux Dev expects around 28 or fewer inference steps. Consider limiting --validation_num_inference_steps to 28.
2024-08-16 17:22:00,828 [INFO] (__main__) Logged into Hugging Face Hub as 'stevejobss'
2024-08-16 17:22:00,829 [INFO] (__main__) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32.
2024-08-16 17:22:00,829 [INFO] (__main__) Load tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-08-16 17:22:02,262 [INFO] (helpers.training.text_encoding) Loading OpenAI CLIP-L text encoder from black-forest-labs/FLUX.1-dev/text_encoder..
2024-08-16 17:22:02,762 [INFO] (helpers.training.text_encoding) Loading T5 XXL v1.1 text encoder from black-forest-labs/FLUX.1-dev/text_encoder_2..
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6009.03it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.84it/s]
2024-08-16 17:22:06,492 [INFO] (__main__) Load VAE: black-forest-labs/FLUX.1-dev
2024-08-16 17:22:07,003 [INFO] (__main__) Moving text encoder to GPU.
2024-08-16 17:22:07,110 [INFO] (__main__) Moving text encoder 2 to GPU.
2024-08-16 17:22:08,391 [INFO] (__main__) Loading VAE onto accelerator, converting from torch.float32 to torch.bfloat16
2024-08-16 17:22:08,463 [INFO] (DataBackendFactory) Loading data backend config from config/multidatabackend.json
2024-08-16 17:22:08,463 [INFO] (DataBackendFactory) Configuring text embed backend: text-embeds
Loading pipeline components...:   0%|                                                                                      | 0/5 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1023.10it/s]
2024-08-16 17:22:08,933 [INFO] (TextEmbeddingCache) (Rank: 0) (id=text-embeds) Listing all text embed cache entries
2024-08-16 17:22:08,934 [INFO] (DataBackendFactory) Pre-computing null embedding
2024-08-16 17:22:13,935 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-08-16 17:22:13,935 [INFO] (DataBackendFactory) Configuring data backend: archviz
2024-08-16 17:22:13,935 [INFO] (DataBackendFactory) (id=archviz) Loading bucket manager.
2024-08-16 17:22:13,937 [INFO] (JsonMetadataBackend) Checking for cache file: /home/user/stable_diffusion/archviz_dataset/aspect_ratio_bucket_indices.json
2024-08-16 17:22:13,937 [INFO] (JsonMetadataBackend) Pulling cache file from storage
2024-08-16 17:22:13,937 [INFO] (DataBackendFactory) (id=archviz) Refreshing aspect buckets on main process.
2024-08-16 17:22:13,937 [INFO] (BaseMetadataBackend) Discovering new files...
2024-08-16 17:22:13,943 [INFO] (BaseMetadataBackend) Compressed 52 existing files from 1.
2024-08-16 17:22:13,943 [INFO] (BaseMetadataBackend) No new files discovered. Doing nothing.
2024-08-16 17:22:13,943 [INFO] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 52, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-08-16 17:22:13,943 [WARNING] (DataBackendFactory) Key crop_aspect_buckets not found in the current backend config, using the existing value 'None'.
2024-08-16 17:22:13,943 [WARNING] (DataBackendFactory) Key disable_validation not found in the current backend config, using the existing value 'False'.
2024-08-16 17:22:13,943 [WARNING] (DataBackendFactory) Key config_version not found in the current backend config, using the existing value '1'.
2024-08-16 17:22:13,943 [WARNING] (DataBackendFactory) Key hash_filenames not found in the current backend config, using the existing value 'True'.
2024-08-16 17:22:13,943 [INFO] (DataBackendFactory) Configured backend: {'id': 'archviz', 'config': {'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 512, 'resolution_type': 'pixel', 'caption_strategy': 'filename', 'instance_data_dir': '/home/user/stable_diffusion/archviz_dataset', 'maximum_image_size': 512, 'target_downsample_size': 512, 'config_version': 1, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7db9f4e47350>, 'instance_data_dir': '/home/user/stable_diffusion/archviz_dataset', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7db9ee178690>}
(Rank: 0)  | Bucket     | Image Count (per-GPU)
------------------------------
(Rank: 0)  | 1.0        | 52          
2024-08-16 17:22:13,944 [INFO] (DataBackendFactory) (id=archviz) Collecting captions.
2024-08-16 17:22:13,944 [INFO] (DataBackendFactory) (id=archviz) Initialise text embed pre-computation using the filename caption strategy. We have 52 captions to process.
2024-08-16 17:22:13,947 [INFO] (DataBackendFactory) (id=archviz) Completed processing 52 captions.
2024-08-16 17:22:13,947 [INFO] (DataBackendFactory) (id=archviz) Creating VAE latent cache.
2024-08-16 17:22:13,948 [INFO] (DataBackendFactory) (id=archviz) Discovering cache objects..
2024-08-16 17:22:13,951 [INFO] (DataBackendFactory) Configured backend: {'id': 'archviz', 'config': {'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 512, 'resolution_type': 'pixel', 'caption_strategy': 'filename', 'instance_data_dir': '/home/user/stable_diffusion/archviz_dataset', 'maximum_image_size': 512, 'target_downsample_size': 512, 'config_version': 1, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7db9f4e47350>, 'instance_data_dir': '/home/user/stable_diffusion/archviz_dataset', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7db9ee178690>, 'train_dataset': <helpers.multiaspect.dataset.MultiAspectDataset object at 0x7db9f4c6f350>, 'sampler': <helpers.multiaspect.sampler.MultiAspectSampler object at 0x7db9ef5f0a50>, 'train_dataloader': <torch.utils.data.dataloader.DataLoader object at 0x7db9edb86fd0>, 'text_embed_cache': <helpers.caching.text_embeds.TextEmbeddingCache object at 0x7db9ef528690>, 'vaecache': <helpers.caching.vae.VAECache object at 0x7db9edc92c90>}
2024-08-16 17:22:14,641 [INFO] (validation) Precomputing the negative prompt embed for validations.
2024-08-16 17:22:14,705 [INFO] (__main__) Unloading text encoders, as they are not being trained.
2024-08-16 17:22:19,193 [INFO] (__main__) After nuking text encoders from orbit, we freed 9.1 GB of VRAM. The real memories were the friends we trained a model on along the way.
Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 27413.75it/s]
2024-08-16 17:22:20,340 [INFO] (__main__) Using LoRA training mode (rank=4)
2024-08-16 17:22:20,478 [INFO] (__main__) Collected the following data backends: ['text-embeds', 'archviz']
2024-08-16 17:22:20,478 [INFO] (__main__) Loading polynomial learning rate scheduler with 1000 warmup steps
2024-08-16 17:22:20,485 [INFO] (__main__) Learning rate: 0.001
2024-08-16 17:22:20,485 [INFO] (helpers.training.optimizer_param) Using bf16 AdamW optimizer with stochastic rounding.
2024-08-16 17:22:20,490 [INFO] (__main__) Optimizer arguments, weight_decay=0.01 eps=1e-08, extra_arguments={'weight_decay': 0.01, 'eps': 1e-08, 'betas': (0.9, 0.999), 'lr': 0.001}
2024-08-16 17:22:20,490 [INFO] (__main__) Loading polynomial learning rate scheduler with 1000 warmup steps
2024-08-16 17:22:20,490 [INFO] (__main__) Using Polynomial learning rate scheduler with last epoch -2.
2024-08-16 17:22:20,493 [INFO] (SaveHookManager) Denoiser class set to: FluxTransformer2DModel.
2024-08-16 17:22:20,493 [INFO] (SaveHookManager) Pipeline class set to: FluxPipeline.
2024-08-16 17:22:20,493 [INFO] (__main__) Loading our accelerator...
CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 23.63 GiB of which 34.56 MiB is free. Process 1108671 has 558.00 MiB memory in use. Process 1114104 has 1.03 GiB memory in use. Including non-PyTorch memory, this process has 21.22 GiB memory in use. Of the allocated memory 20.73 GiB is allocated by PyTorch, and 35.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/home/user/stable_diffusion/SimpleTuner/train.py", line 2315, in <module>
    main()
  File "/home/user/stable_diffusion/SimpleTuner/train.py", line 923, in main
    results = accelerator.prepare(
              ^^^^^^^^^^^^^^^^^^^^
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/accelerate/accelerator.py", line 1311, in prepare
    result = tuple(
             ^^^^^^
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/accelerate/accelerator.py", line 1312, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/accelerate/accelerator.py", line 1188, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/accelerate/accelerator.py", line 1435, in prepare_model
    model = model.to(self.device)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1174, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/user/stable_diffusion/lora-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in convert
    return t.to(
           ^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 23.63 GiB of which 34.56 MiB is free. Process 1108671 has 558.00 MiB memory in use. Process 1114104 has 1.03 GiB memory in use. Including non-PyTorch memory, this process has 21.22 GiB memory in use. Of the allocated memory 20.73 GiB is allocated by PyTorch, and 35.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
bghira commented 2 months ago

you can try #785 (work in progress, not guaranteed results yet) for reduction further in memory use but ultimately without that patch this likely requires fp8-quanto

vsakovskaya commented 2 months ago

@bghira thank you! unfortunately I still get the same error but when moving to device now

SimpleTuner/train.py", line 369, in main
    text_encoder_2.to(accelerator.device, dtype=weight_dtype)
bghira commented 2 months ago

i'll need more context, the text encoders only consume 9GB VRAM.

bghira commented 2 months ago

unfortunately the new optimisers didn't really do a whole lot to fix the problem so your best bet is to get quanto going