OOM errors with multi-GPU Pascal setup

With three P100 16GB GPUs installed in the system, the following exception is eventually thrown:

(base) derp@t7910:~/fluxgym$ source env/bin/activate
(env) (base) derp@t7910:~/fluxgym$ ls
advanced.png        Dockerfile           install.js         publish_to_hf.png  sample_prompts.txt  train.sh
app-launch.sh       Dockerfile.cuda12.4  models             README.md          screenshot.png      update.js
app.py              env                  models.yaml        requirements.txt   sd-scripts
datasets            flags.png            outputs            reset.js           seed.gif
dataset.toml        flow.gif             pinokio.js         sample_fields.png  start.js
docker-compose.yml  icon.png             pinokio_meta.json  sample.png         torch.js
(env) (base) derp@t7910:~/fluxgym$ bash outputs/my-special-lora/train.sh 
The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `3`
        More than one GPU was found, enabling multi-GPU training.
        If this was unintended please pass in `--num_processes=1`.
    `--num_machines` was set to a value of `1`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
highvram is enabled / highvramが有効です
highvram is enabled / highvramが有効です
highvram is enabled / highvramが有効です
2024-10-02 16:12:05 WARNING  cache_latents_to_disk is enabled, so cache_latents is also       train_util.py:3936
                             enabled /                                                                          
                             cache_latents_to_diskが有効なため、cache_latentsを有効にします                     
2024-10-02 16:12:05 WARNING  cache_latents_to_disk is enabled, so cache_latents is also       train_util.py:3936
                             enabled /                                                                          
                             cache_latents_to_diskが有効なため、cache_latentsを有効にします                     
2024-10-02 16:12:05 WARNING  cache_latents_to_disk is enabled, so cache_latents is also       train_util.py:3936
                             enabled /                                                                          
                             cache_latents_to_diskが有効なため、cache_latentsを有効にします                     
2024-10-02 16:12:05 INFO     t5xxl_max_token_length: 512                               flux_train_network.py:155
2024-10-02 16:12:05 INFO     t5xxl_max_token_length: 512                               flux_train_network.py:155
2024-10-02 16:12:05 INFO     t5xxl_max_token_length: 512                               flux_train_network.py:155
/home/derp/fluxgym/env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
/home/derp/fluxgym/env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
/home/derp/fluxgym/env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
2024-10-02 16:12:06 INFO     Loading dataset config from                                    train_network.py:280
                             /home/derp/fluxgym/outputs/my-special-lora/dataset.toml                                 
2024-10-02 16:12:06 INFO     Loading dataset config from                                    train_network.py:280
                             /home/derp/fluxgym/outputs/my-special-lora/dataset.toml                                 
2024-10-02 16:12:06 INFO     Loading dataset config from                                    train_network.py:280
                             /home/derp/fluxgym/outputs/my-special-lora/dataset.toml                                 
                    INFO     prepare images.                                                  train_util.py:1807
                    INFO     prepare images.                                                  train_util.py:1807
                    INFO     prepare images.                                                  train_util.py:1807
                    INFO     get image size from name of cache files                          train_util.py:1745
                    INFO     get image size from name of cache files                          train_util.py:1745
                    INFO     get image size from name of cache files                          train_util.py:1745
100%|████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 728.52it/s]
100%|████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 727.14it/s]
                    INFO     set image size from cache files: 317/317                         train_util.py:1752
                    INFO     set image size from cache files: 317/317                         train_util.py:1752
                    INFO     found directory /home/derp/fluxgym/datasets/my-special-lora contains  train_util.py:1754
                             317 image files                                                                    
                    INFO     found directory /home/derp/fluxgym/datasets/my-special-lora contains  train_util.py:1754
                             317 image files                                                                    
100%|████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 720.55it/s]
                    INFO     set image size from cache files: 317/317                         train_util.py:1752
                    INFO     found directory /home/derp/fluxgym/datasets/my-special-lora contains  train_util.py:1754
                             317 image files                                                                    
                    INFO     951 train images with repeating.                                 train_util.py:1848
                    INFO     951 train images with repeating.                                 train_util.py:1848
                    INFO     951 train images with repeating.                                 train_util.py:1848
                    INFO     0 reg images.                                                    train_util.py:1851
                    INFO     0 reg images.                                                    train_util.py:1851
                    INFO     0 reg images.                                                    train_util.py:1851
                    WARNING  no regularization images / 正則化画像が見つかりませんでした      train_util.py:1856
                    WARNING  no regularization images / 正則化画像が見つかりませんでした      train_util.py:1856
                    WARNING  no regularization images / 正則化画像が見つかりませんでした      train_util.py:1856
                    INFO     [Dataset 0]                                                      config_util.py:570
                               batch_size: 1                                                                    
                               resolution: (1024, 1024)                                                         
                               enable_bucket: True                                                              
                               network_multiplier: 1.0                                                          
                               min_bucket_reso: 256                                                             
                               max_bucket_reso: 1024                                                            
                               bucket_reso_steps: 64                                                            
                               bucket_no_upscale: False                                                         

                               [Subset 0 of Dataset 0]                                                          
                                 image_dir: "/home/derp/fluxgym/datasets/my-special-lora"                            
                                 image_count: 317                                                               
                                 num_repeats: 3                                                                 
                                 shuffle_caption: False                                                         
                                 keep_tokens: 1                                                                 
                                 keep_tokens_separator: 1                                                       
                                 caption_separator: ,                                                           
                                 secondary_separator: None                                                      
                                 enable_wildcard: False                                                         
                                 caption_dropout_rate: 0.0                                                      
                                 caption_dropout_every_n_epoches: 0                                             
                                 caption_tag_dropout_rate: 0.0                                                  
                                 caption_prefix: None                                                           
                                 caption_suffix: None                                                           
                                 color_aug: False                                                               
                                 flip_aug: True                                                                 
                                 face_crop_aug_range: None                                                      
                                 random_crop: False                                                             
                                 token_warmup_min: 1,                                                           
                                 token_warmup_step: 0,                                                          
                                 alpha_mask: False,                                                             
                                 is_reg: False                                                                  
                                 class_tokens: my-special-lora                                                       
                                 caption_extension: .txt                                                        

                    INFO     [Dataset 0]                                                      config_util.py:570
                               batch_size: 1                                                                    
                               resolution: (1024, 1024)                                                         
                               enable_bucket: True                                                              
                               network_multiplier: 1.0                                                          
                               min_bucket_reso: 256                                                             
                               max_bucket_reso: 1024                                                            
                               bucket_reso_steps: 64                                                            
                               bucket_no_upscale: False                                                         

                               [Subset 0 of Dataset 0]                                                          
                                 image_dir: "/home/derp/fluxgym/datasets/my-special-lora"                            
                                 image_count: 317                                                               
                                 num_repeats: 3                                                                 
                                 shuffle_caption: False                                                         
                                 keep_tokens: 1                                                                 
                                 keep_tokens_separator: 1                                                       
                                 caption_separator: ,                                                           
                                 secondary_separator: None                                                      
                                 enable_wildcard: False                                                         
                                 caption_dropout_rate: 0.0                                                      
                                 caption_dropout_every_n_epoches: 0                                             
                                 caption_tag_dropout_rate: 0.0                                                  
                                 caption_prefix: None                                                           
                                 caption_suffix: None                                                           
                                 color_aug: False                                                               
                                 flip_aug: True                                                                 
                                 face_crop_aug_range: None                                                      
                                 random_crop: False                                                             
                                 token_warmup_min: 1,                                                           
                                 token_warmup_step: 0,                                                          
                                 alpha_mask: False,                                                             
                                 is_reg: False                                                                  
                                 class_tokens: my-special-lora                                                       
                                 caption_extension: .txt                                                        

                    INFO     [Dataset 0]                                                      config_util.py:570
                               batch_size: 1                                                                    
                               resolution: (1024, 1024)                                                         
                               enable_bucket: True                                                              
                               network_multiplier: 1.0                                                          
                               min_bucket_reso: 256                                                             
                               max_bucket_reso: 1024                                                            
                               bucket_reso_steps: 64                                                            
                               bucket_no_upscale: False                                                         

                               [Subset 0 of Dataset 0]                                                          
                                 image_dir: "/home/derp/fluxgym/datasets/my-special-lora"                            
                                 image_count: 317                                                               
                                 num_repeats: 3                                                                 
                                 shuffle_caption: False                                                         
                                 keep_tokens: 1                                                                 
                                 keep_tokens_separator: 1                                                       
                                 caption_separator: ,                                                           
                                 secondary_separator: None                                                      
                                 enable_wildcard: False                                                         
                                 caption_dropout_rate: 0.0                                                      
                                 caption_dropout_every_n_epoches: 0                                             
                                 caption_tag_dropout_rate: 0.0                                                  
                                 caption_prefix: None                                                           
                                 caption_suffix: None                                                           
                                 color_aug: False                                                               
                                 flip_aug: True                                                                 
                                 face_crop_aug_range: None                                                      
                                 random_crop: False                                                             
                                 token_warmup_min: 1,                                                           
                                 token_warmup_step: 0,                                                          
                                 alpha_mask: False,                                                             
                                 is_reg: False                                                                  
                                 class_tokens: my-special-lora                                                       
                                 caption_extension: .txt                                                        

                    INFO     [Dataset 0]                                                      config_util.py:576
                    INFO     [Dataset 0]                                                      config_util.py:576
                    INFO     [Dataset 0]                                                      config_util.py:576
                    INFO     loading image sizes.                                              train_util.py:880
                    INFO     loading image sizes.                                              train_util.py:880
                    INFO     loading image sizes.                                              train_util.py:880
100%|████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 3274862.98it/s]
100%|████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 3266816.63it/s]
100%|████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 3150697.55it/s]
                    INFO     make buckets                                                      train_util.py:886
                    INFO     make buckets                                                      train_util.py:886
                    INFO     make buckets                                                      train_util.py:886
                    INFO     number of images (including repeats) /                            train_util.py:932
                             各bucketの画像枚数（繰り返し回数を含む）                                           
                    INFO     number of images (including repeats) /                            train_util.py:932
                             各bucketの画像枚数（繰り返し回数を含む）                                           
                    INFO     number of images (including repeats) /                            train_util.py:932
                             各bucketの画像枚数（繰り返し回数を含む）                                           
                    INFO     bucket 0: resolution (1024, 1024), count: 951                     train_util.py:937
                    INFO     bucket 0: resolution (1024, 1024), count: 951                     train_util.py:937
                    INFO     bucket 0: resolution (1024, 1024), count: 951                     train_util.py:937
                    INFO     mean ar error (without repeats): 3.6967665615141956e-05           train_util.py:942
                    INFO     mean ar error (without repeats): 3.6967665615141956e-05           train_util.py:942
                    INFO     mean ar error (without repeats): 3.6967665615141956e-05           train_util.py:942
                    INFO     preparing accelerator                                          train_network.py:345
                    INFO     preparing accelerator                                          train_network.py:345
                    INFO     preparing accelerator                                          train_network.py:345
[2024-10-02 16:12:07,046] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-02 16:12:07,046] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-02 16:12:07,046] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-02 16:12:08,650] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-02 16:12:08,650] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-02 16:12:08,650] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-10-02 16:12:08,705] [INFO] [comm.py:652:init_distributed] cdb=None
accelerator device: cuda:0
                    INFO     Building Flux model dev                                            flux_utils.py:45
accelerator device: cuda:2
accelerator device: cuda:1
2024-10-02 16:12:09 INFO     Building Flux model dev                                            flux_utils.py:45
2024-10-02 16:12:09 INFO     Building Flux model dev                                            flux_utils.py:45
2024-10-02 16:12:09 INFO     Loading state dict from                                            flux_utils.py:52
                             /home/derp/fluxgym/models/unet/flux1-dev.sft                                       
                    INFO     Loading state dict from                                            flux_utils.py:52
                             /home/derp/fluxgym/models/unet/flux1-dev.sft                                       
                    INFO     Loading state dict from                                            flux_utils.py:52
                             /home/derp/fluxgym/models/unet/flux1-dev.sft                                       
                    INFO     Loaded Flux: <All keys matched successfully>                       flux_utils.py:55
                    INFO     Loaded Flux: <All keys matched successfully>                       flux_utils.py:55
                    INFO     Loaded Flux: <All keys matched successfully>                       flux_utils.py:55
                    INFO     Building CLIP                                                      flux_utils.py:74
                    INFO     Building CLIP                                                      flux_utils.py:74
                    INFO     Building CLIP                                                      flux_utils.py:74
                    INFO     Loading state dict from                                           flux_utils.py:167
                             /home/derp/fluxgym/models/clip/clip_l.safetensors                                  
                    INFO     Loading state dict from                                           flux_utils.py:167
                             /home/derp/fluxgym/models/clip/clip_l.safetensors                                  
                    INFO     Loading state dict from                                           flux_utils.py:167
                             /home/derp/fluxgym/models/clip/clip_l.safetensors                                  
                    INFO     Loaded CLIP: <All keys matched successfully>                      flux_utils.py:170
                    INFO     Loaded CLIP: <All keys matched successfully>                      flux_utils.py:170
                    INFO     Loaded CLIP: <All keys matched successfully>                      flux_utils.py:170
                    INFO     Loading state dict from                                           flux_utils.py:215
                             /home/derp/fluxgym/models/clip/t5xxl_fp16.safetensors                              
                    INFO     Loading state dict from                                           flux_utils.py:215
                             /home/derp/fluxgym/models/clip/t5xxl_fp16.safetensors                              
                    INFO     Loading state dict from                                           flux_utils.py:215
                             /home/derp/fluxgym/models/clip/t5xxl_fp16.safetensors                              
                    INFO     Loaded T5xxl: <All keys matched successfully>                     flux_utils.py:218
                    INFO     Loaded T5xxl: <All keys matched successfully>                     flux_utils.py:218
                    INFO     Loaded T5xxl: <All keys matched successfully>                     flux_utils.py:218
                    INFO     Building AutoEncoder                                               flux_utils.py:62
                    INFO     Building AutoEncoder                                               flux_utils.py:62
                    INFO     Building AutoEncoder                                               flux_utils.py:62
                    INFO     Loading state dict from /home/derp/fluxgym/models/vae/ae.sft       flux_utils.py:66
                    INFO     Loading state dict from /home/derp/fluxgym/models/vae/ae.sft       flux_utils.py:66
                    INFO     Loading state dict from /home/derp/fluxgym/models/vae/ae.sft       flux_utils.py:66
                    INFO     Loaded AE: <All keys matched successfully>                         flux_utils.py:69
import network module: networks.lora_flux
                    INFO     Loaded AE: <All keys matched successfully>                         flux_utils.py:69
                    INFO     Loaded AE: <All keys matched successfully>                         flux_utils.py:69
2024-10-02 16:12:10 INFO     [Dataset 0]                                                           train_util.py:2328
                    INFO     caching latents with caching strategy.                                 train_util.py:988
                    INFO     checking cache validity...                                             train_util.py:998
100%|██████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 364472.14it/s]
2024-10-02 16:12:10 INFO     [Dataset 0]                                                           train_util.py:2328
2024-10-02 16:12:10 INFO     [Dataset 0]                                                           train_util.py:2328
                    INFO     caching latents with caching strategy.                                 train_util.py:988
                    INFO     caching latents with caching strategy.                                 train_util.py:988
                    INFO     checking cache validity...                                             train_util.py:998
                    INFO     checking cache validity...                                             train_util.py:998
100%|██████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 437366.57it/s]
100%|████████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 2474.29it/s]
2024-10-02 16:12:11 INFO     no latents to cache                                                   train_util.py:1038
[rank2]:[W1002 16:12:11.735902281 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W1002 16:12:11.739201920 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1002 16:12:11.859788105 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
2024-10-02 16:12:11 INFO     move vae and unet to cpu to save memory                        flux_train_network.py:208
                    INFO     move vae and unet to cpu to save memory                        flux_train_network.py:208
2024-10-02 16:12:11 INFO     move vae and unet to cpu to save memory                        flux_train_network.py:208
                    INFO     move text encoders to gpu                                      flux_train_network.py:216
                    INFO     move text encoders to gpu                                      flux_train_network.py:216
                    INFO     move text encoders to gpu                                      flux_train_network.py:216
2024-10-02 16:12:33 INFO     [Dataset 0]                                                           train_util.py:2349
                    INFO     caching Text Encoder outputs with caching strategy.                   train_util.py:1111
                    INFO     checking cache validity...                                            train_util.py:1117
100%|████████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 1756.30it/s]
2024-10-02 16:12:34 INFO     no Text Encoder outputs to cache                                      train_util.py:1139
                    INFO     cache Text Encoder outputs for sample prompt:                  flux_train_network.py:232
                             /home/derp/fluxgym/outputs/my-special-lora/sample_prompts.txt                                
                    INFO     cache Text Encoder outputs for prompt: my-special-lora              flux_train_network.py:243
2024-10-02 16:12:34 INFO     [Dataset 0]                                                           train_util.py:2349
                    INFO     caching Text Encoder outputs with caching strategy.                   train_util.py:1111
2024-10-02 16:12:34 INFO     [Dataset 0]                                                           train_util.py:2349
                    INFO     checking cache validity...                                            train_util.py:1117
  0%|                                                                                        | 0/317 [00:00<?, ?it/s]                    INFO     caching Text Encoder outputs with caching strategy.                   train_util.py:1111
                    INFO     checking cache validity...                                            train_util.py:1117
100%|████████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 4201.50it/s]
100%|████████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 4266.01it/s]
                    INFO     no Text Encoder outputs to cache                                      train_util.py:1139
                    INFO     no Text Encoder outputs to cache                                      train_util.py:1139
                    INFO     cache Text Encoder outputs for sample prompt:                  flux_train_network.py:232
                             /home/derp/fluxgym/outputs/my-special-lora/sample_prompts.txt                                
                    INFO     cache Text Encoder outputs for sample prompt:                  flux_train_network.py:232
                             /home/derp/fluxgym/outputs/my-special-lora/sample_prompts.txt                                
                    INFO     cache Text Encoder outputs for prompt: my-special-lora              flux_train_network.py:243
                    INFO     cache Text Encoder outputs for prompt: my-special-lora              flux_train_network.py:243
                    INFO     cache Text Encoder outputs for prompt:                         flux_train_network.py:243
                    INFO     cache Text Encoder outputs for prompt:                         flux_train_network.py:243
                    INFO     cache Text Encoder outputs for prompt:                         flux_train_network.py:243
2024-10-02 16:12:36 INFO     move t5XXL back to cpu                                         flux_train_network.py:256
2024-10-02 16:12:36 INFO     move t5XXL back to cpu                                         flux_train_network.py:256
2024-10-02 16:12:36 INFO     move t5XXL back to cpu                                         flux_train_network.py:256
2024-10-02 16:12:41 INFO     move vae and unet back to original device                      flux_train_network.py:261
                    INFO     create LoRA network. base dim (rank): 8, alpha: 8.0                     lora_flux.py:484
                    INFO     neuron dropout: p=None, rank dropout: p=None, module dropout: p=None    lora_flux.py:485
                    INFO     train all blocks only                                                   lora_flux.py:495
                    INFO     create LoRA for Text Encoder 1:                                         lora_flux.py:576
2024-10-02 16:12:41 INFO     move vae and unet back to original device                      flux_train_network.py:261
                    INFO     create LoRA network. base dim (rank): 8, alpha: 8.0                     lora_flux.py:484
                    INFO     neuron dropout: p=None, rank dropout: p=None, module dropout: p=None    lora_flux.py:485
                    INFO     train all blocks only                                                   lora_flux.py:495
                    INFO     create LoRA for Text Encoder 1:                                         lora_flux.py:576
                    INFO     create LoRA for Text Encoder 1: 72 modules.                             lora_flux.py:579
                    INFO     create LoRA for Text Encoder 1: 72 modules.                             lora_flux.py:579
2024-10-02 16:12:41 INFO     move vae and unet back to original device                      flux_train_network.py:261
                    INFO     create LoRA network. base dim (rank): 8, alpha: 8.0                     lora_flux.py:484
                    INFO     neuron dropout: p=None, rank dropout: p=None, module dropout: p=None    lora_flux.py:485
                    INFO     train all blocks only                                                   lora_flux.py:495
                    INFO     create LoRA for Text Encoder 1:                                         lora_flux.py:576
                    INFO     create LoRA for Text Encoder 1: 72 modules.                             lora_flux.py:579
                    INFO     create LoRA for FLUX all blocks: 304 modules.                           lora_flux.py:593
                    INFO     enable LoRA for text encoder: 72 modules                                lora_flux.py:736
                    INFO     enable LoRA for U-Net: 304 modules                                      lora_flux.py:741
FLUX: Gradient checkpointing enabled. CPU offload: False
                    INFO     Text Encoder 1 (CLIP-L): 72 modules, LR 0.0001                          lora_flux.py:843
                    INFO     use Adafactor optimizer | {'relative_step': False, 'scale_parameter': train_util.py:4541
                             False, 'warmup_init': False}                                                            
                    INFO     create LoRA for FLUX all blocks: 304 modules.                           lora_flux.py:593
                    INFO     enable LoRA for text encoder: 72 modules                                lora_flux.py:736
                    INFO     enable LoRA for U-Net: 304 modules                                      lora_flux.py:741
FLUX: Gradient checkpointing enabled. CPU offload: False
prepare optimizer, data loader etc.
                    INFO     Text Encoder 1 (CLIP-L): 72 modules, LR 0.0001                          lora_flux.py:843
                    INFO     use Adafactor optimizer | {'relative_step': False, 'scale_parameter': train_util.py:4541
                             False, 'warmup_init': False}                                                            
override steps. steps for 40 epochs is / 指定エポックまでのステップ数: 12680
enable fp8 training for U-Net.
enable fp8 training for Text Encoder.
2024-10-02 16:12:42 INFO     create LoRA for FLUX all blocks: 304 modules.                           lora_flux.py:593
                    INFO     enable LoRA for text encoder: 72 modules                                lora_flux.py:736
                    INFO     enable LoRA for U-Net: 304 modules                                      lora_flux.py:741
FLUX: Gradient checkpointing enabled. CPU offload: False
                    INFO     Text Encoder 1 (CLIP-L): 72 modules, LR 0.0001                          lora_flux.py:843
                    INFO     use Adafactor optimizer | {'relative_step': False, 'scale_parameter': train_util.py:4541
                             False, 'warmup_init': False}                                                            
2024-10-02 16:13:45 INFO     prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set        flux_train_network.py:464
                             embeddings to torch.bfloat16                                                            
[2024-10-02 16:13:45,963] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 3
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/derp/fluxgym/sd-scripts/flux_train_network.py", line 519, in <module>
[rank1]:     trainer.train(args)
[rank1]:   File "/home/derp/fluxgym/sd-scripts/train_network.py", line 590, in train
[rank1]:     ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank1]:                                                           ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank1]:     result = self._prepare_deepspeed(*args)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank1]:     engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank1]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank1]:     engine = DeepSpeedEngine(args=args,
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 269, in __init__
[rank1]:     self._configure_distributed_model(model)
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model
[rank1]:     self.module.bfloat16()
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in bfloat16
[rank1]:     return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]:     module._apply(fn)
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]:     module._apply(fn)
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]:     module._apply(fn)
[rank1]:   [Previous line repeated 3 more times]
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank1]:     param_applied = fn(param)
[rank1]:                     ^^^^^^^^^
[rank1]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in <lambda>
[rank1]:     return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank1]:                                  ^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 1 has a total capacity of 15.89 GiB of which 67.12 MiB is free. Including non-PyTorch memory, this process has 15.82 GiB memory in use. Of the allocated memory 15.18 GiB is allocated by PyTorch, and 291.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2024-10-02 16:13:47 INFO     prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set        flux_train_network.py:464
                             embeddings to torch.bfloat16                                                            
2024-10-02 16:13:47 INFO     prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set        flux_train_network.py:464
                             embeddings to torch.bfloat16                                                            
[2024-10-02 16:13:47,178] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-10-02 16:13:47,179] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 3
[2024-10-02 16:13:47,181] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 3
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/derp/fluxgym/sd-scripts/flux_train_network.py", line 519, in <module>
[rank0]:     trainer.train(args)
[rank0]:   File "/home/derp/fluxgym/sd-scripts/train_network.py", line 590, in train
[rank0]:     ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank0]:                                                           ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank0]:     result = self._prepare_deepspeed(*args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank0]:     engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank0]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank0]:     engine = DeepSpeedEngine(args=args,
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 269, in __init__
[rank0]:     self._configure_distributed_model(model)
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model
[rank0]:     self.module.bfloat16()
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in bfloat16
[rank0]:     return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]:     module._apply(fn)
[rank0]:   [Previous line repeated 3 more times]
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank0]:     param_applied = fn(param)
[rank0]:                     ^^^^^^^^^
[rank0]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in <lambda>
[rank0]:     return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank0]:                                  ^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 15.89 GiB of which 67.12 MiB is free. Including non-PyTorch memory, this process has 15.82 GiB memory in use. Of the allocated memory 15.18 GiB is allocated by PyTorch, and 291.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/derp/fluxgym/sd-scripts/flux_train_network.py", line 519, in <module>
[rank2]:     trainer.train(args)
[rank2]:   File "/home/derp/fluxgym/sd-scripts/train_network.py", line 590, in train
[rank2]:     ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank2]:                                                           ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank2]:     result = self._prepare_deepspeed(*args)
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank2]:     engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank2]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank2]:     engine = DeepSpeedEngine(args=args,
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 269, in __init__
[rank2]:     self._configure_distributed_model(model)
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model
[rank2]:     self.module.bfloat16()
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in bfloat16
[rank2]:     return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]:     module._apply(fn)
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]:     module._apply(fn)
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]:     module._apply(fn)
[rank2]:   [Previous line repeated 3 more times]
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank2]:     param_applied = fn(param)
[rank2]:                     ^^^^^^^^^
[rank2]:   File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in <lambda>
[rank2]:     return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank2]:                                  ^^^^^^^^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 2 has a total capacity of 15.89 GiB of which 67.12 MiB is free. Including non-PyTorch memory, this process has 15.82 GiB memory in use. Of the allocated memory 15.18 GiB is allocated by PyTorch, and 291.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W1002 16:13:48.643604082 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W1002 16:13:48.300000 1358 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1424 closing signal SIGTERM
W1002 16:13:48.303000 1358 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1426 closing signal SIGTERM
E1002 16:13:48.619000 1358 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1425) of binary: /home/derp/fluxgym/env/bin/python
Traceback (most recent call last):
  File "/home/derp/fluxgym/env/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sd-scripts/flux_train_network.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-02_16:13:48
  host      : t7910.lab.local
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1425)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

using the following config:

accelerate launch \
  --mixed_precision bf16 \
  --num_cpu_threads_per_process 1 \
  sd-scripts/flux_train_network.py \
  --pretrained_model_name_or_path "/home/derp/fluxgym/models/unet/flux1-dev.sft" \
  --clip_l "/home/derp/fluxgym/models/clip/clip_l.safetensors" \
  --t5xxl "/home/derp/fluxgym/models/clip/t5xxl_fp16.safetensors" \
  --ae "/home/derp/fluxgym/models/vae/ae.sft" \
  --cache_latents_to_disk \
  --save_model_as safetensors \
  --sdpa --persistent_data_loader_workers \
  --max_data_loader_n_workers 2 \
  --seed 42 \
  --gradient_checkpointing \
  --mixed_precision bf16 \
  --save_precision bf16 \
  --network_module networks.lora_flux \
  --network_dim 8 \
  --optimizer_type adafactor \
  --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
  --lr_scheduler constant_with_warmup \
  --max_grad_norm 0.0 \--sample_prompts="/home/derp/fluxgym/outputs/my-special-lora/sample_prompts.txt" --sample_every_n_steps="200" \
  --learning_rate 2e-4 \
  --cache_text_encoder_outputs \
  --cache_text_encoder_outputs_to_disk \
  --fp8_base \
  --highvram \
  --max_train_epochs 40 \
  --save_every_n_epochs 4 \
  --dataset_config "/home/derp/fluxgym/outputs/my-special-lora/dataset.toml" \
  --output_dir "/home/derp/fluxgym/outputs/my-special-lora" \
  --output_name my-special-lora \
  --timestep_sampling shift \
  --discrete_flow_shift 3.1582 \
  --model_prediction_type raw \
  --guidance_scale 1 \
  --loss_type l2 \
  --clip_skip 1 \
  --deepspeed \
  --enable_bucket \
  --flip_aug \
  --keep_tokens_separator 1 \
  --lr_scheduler_num_cycles 3 \
  --min_snr_gamma 5 \
  --network_alpha 8 \
  --noise_offset 0.1 \
  --text_encoder_lr 0.0001

If a fourth GPU is installed, an RTX A4000, there is no OOM error message!

cocktailpeanut / fluxgym

OOM errors with multi-GPU Pascal setup #162