With three P100 16GB GPUs installed in the system, the following exception is eventually thrown:
(base) derp@t7910:~/fluxgym$ source env/bin/activate
(env) (base) derp@t7910:~/fluxgym$ ls
advanced.png Dockerfile install.js publish_to_hf.png sample_prompts.txt train.sh
app-launch.sh Dockerfile.cuda12.4 models README.md screenshot.png update.js
app.py env models.yaml requirements.txt sd-scripts
datasets flags.png outputs reset.js seed.gif
dataset.toml flow.gif pinokio.js sample_fields.png start.js
docker-compose.yml icon.png pinokio_meta.json sample.png torch.js
(env) (base) derp@t7910:~/fluxgym$ bash outputs/my-special-lora/train.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `3`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
highvram is enabled / highvramが有効です
highvram is enabled / highvramが有効です
highvram is enabled / highvramが有効です
2024-10-02 16:12:05 WARNING cache_latents_to_disk is enabled, so cache_latents is also train_util.py:3936
enabled /
cache_latents_to_diskが有効なため、cache_latentsを有効にします
2024-10-02 16:12:05 WARNING cache_latents_to_disk is enabled, so cache_latents is also train_util.py:3936
enabled /
cache_latents_to_diskが有効なため、cache_latentsを有効にします
2024-10-02 16:12:05 WARNING cache_latents_to_disk is enabled, so cache_latents is also train_util.py:3936
enabled /
cache_latents_to_diskが有効なため、cache_latentsを有効にします
2024-10-02 16:12:05 INFO t5xxl_max_token_length: 512 flux_train_network.py:155
2024-10-02 16:12:05 INFO t5xxl_max_token_length: 512 flux_train_network.py:155
2024-10-02 16:12:05 INFO t5xxl_max_token_length: 512 flux_train_network.py:155
/home/derp/fluxgym/env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
/home/derp/fluxgym/env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
/home/derp/fluxgym/env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
2024-10-02 16:12:06 INFO Loading dataset config from train_network.py:280
/home/derp/fluxgym/outputs/my-special-lora/dataset.toml
2024-10-02 16:12:06 INFO Loading dataset config from train_network.py:280
/home/derp/fluxgym/outputs/my-special-lora/dataset.toml
2024-10-02 16:12:06 INFO Loading dataset config from train_network.py:280
/home/derp/fluxgym/outputs/my-special-lora/dataset.toml
INFO prepare images. train_util.py:1807
INFO prepare images. train_util.py:1807
INFO prepare images. train_util.py:1807
INFO get image size from name of cache files train_util.py:1745
INFO get image size from name of cache files train_util.py:1745
INFO get image size from name of cache files train_util.py:1745
100%|████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 728.52it/s]
100%|████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 727.14it/s]
INFO set image size from cache files: 317/317 train_util.py:1752
INFO set image size from cache files: 317/317 train_util.py:1752
INFO found directory /home/derp/fluxgym/datasets/my-special-lora contains train_util.py:1754
317 image files
INFO found directory /home/derp/fluxgym/datasets/my-special-lora contains train_util.py:1754
317 image files
100%|████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 720.55it/s]
INFO set image size from cache files: 317/317 train_util.py:1752
INFO found directory /home/derp/fluxgym/datasets/my-special-lora contains train_util.py:1754
317 image files
INFO 951 train images with repeating. train_util.py:1848
INFO 951 train images with repeating. train_util.py:1848
INFO 951 train images with repeating. train_util.py:1848
INFO 0 reg images. train_util.py:1851
INFO 0 reg images. train_util.py:1851
INFO 0 reg images. train_util.py:1851
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1856
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1856
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1856
INFO [Dataset 0] config_util.py:570
batch_size: 1
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: False
[Subset 0 of Dataset 0]
image_dir: "/home/derp/fluxgym/datasets/my-special-lora"
image_count: 317
num_repeats: 3
shuffle_caption: False
keep_tokens: 1
keep_tokens_separator: 1
caption_separator: ,
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: True
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
alpha_mask: False,
is_reg: False
class_tokens: my-special-lora
caption_extension: .txt
INFO [Dataset 0] config_util.py:570
batch_size: 1
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: False
[Subset 0 of Dataset 0]
image_dir: "/home/derp/fluxgym/datasets/my-special-lora"
image_count: 317
num_repeats: 3
shuffle_caption: False
keep_tokens: 1
keep_tokens_separator: 1
caption_separator: ,
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: True
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
alpha_mask: False,
is_reg: False
class_tokens: my-special-lora
caption_extension: .txt
INFO [Dataset 0] config_util.py:570
batch_size: 1
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: False
[Subset 0 of Dataset 0]
image_dir: "/home/derp/fluxgym/datasets/my-special-lora"
image_count: 317
num_repeats: 3
shuffle_caption: False
keep_tokens: 1
keep_tokens_separator: 1
caption_separator: ,
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: True
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
alpha_mask: False,
is_reg: False
class_tokens: my-special-lora
caption_extension: .txt
INFO [Dataset 0] config_util.py:576
INFO [Dataset 0] config_util.py:576
INFO [Dataset 0] config_util.py:576
INFO loading image sizes. train_util.py:880
INFO loading image sizes. train_util.py:880
INFO loading image sizes. train_util.py:880
100%|████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 3274862.98it/s]
100%|████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 3266816.63it/s]
100%|████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 3150697.55it/s]
INFO make buckets train_util.py:886
INFO make buckets train_util.py:886
INFO make buckets train_util.py:886
INFO number of images (including repeats) / train_util.py:932
各bucketの画像枚数(繰り返し回数を含む)
INFO number of images (including repeats) / train_util.py:932
各bucketの画像枚数(繰り返し回数を含む)
INFO number of images (including repeats) / train_util.py:932
各bucketの画像枚数(繰り返し回数を含む)
INFO bucket 0: resolution (1024, 1024), count: 951 train_util.py:937
INFO bucket 0: resolution (1024, 1024), count: 951 train_util.py:937
INFO bucket 0: resolution (1024, 1024), count: 951 train_util.py:937
INFO mean ar error (without repeats): 3.6967665615141956e-05 train_util.py:942
INFO mean ar error (without repeats): 3.6967665615141956e-05 train_util.py:942
INFO mean ar error (without repeats): 3.6967665615141956e-05 train_util.py:942
INFO preparing accelerator train_network.py:345
INFO preparing accelerator train_network.py:345
INFO preparing accelerator train_network.py:345
[2024-10-02 16:12:07,046] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-02 16:12:07,046] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-02 16:12:07,046] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-02 16:12:08,650] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-02 16:12:08,650] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-02 16:12:08,650] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-10-02 16:12:08,705] [INFO] [comm.py:652:init_distributed] cdb=None
accelerator device: cuda:0
INFO Building Flux model dev flux_utils.py:45
accelerator device: cuda:2
accelerator device: cuda:1
2024-10-02 16:12:09 INFO Building Flux model dev flux_utils.py:45
2024-10-02 16:12:09 INFO Building Flux model dev flux_utils.py:45
2024-10-02 16:12:09 INFO Loading state dict from flux_utils.py:52
/home/derp/fluxgym/models/unet/flux1-dev.sft
INFO Loading state dict from flux_utils.py:52
/home/derp/fluxgym/models/unet/flux1-dev.sft
INFO Loading state dict from flux_utils.py:52
/home/derp/fluxgym/models/unet/flux1-dev.sft
INFO Loaded Flux: <All keys matched successfully> flux_utils.py:55
INFO Loaded Flux: <All keys matched successfully> flux_utils.py:55
INFO Loaded Flux: <All keys matched successfully> flux_utils.py:55
INFO Building CLIP flux_utils.py:74
INFO Building CLIP flux_utils.py:74
INFO Building CLIP flux_utils.py:74
INFO Loading state dict from flux_utils.py:167
/home/derp/fluxgym/models/clip/clip_l.safetensors
INFO Loading state dict from flux_utils.py:167
/home/derp/fluxgym/models/clip/clip_l.safetensors
INFO Loading state dict from flux_utils.py:167
/home/derp/fluxgym/models/clip/clip_l.safetensors
INFO Loaded CLIP: <All keys matched successfully> flux_utils.py:170
INFO Loaded CLIP: <All keys matched successfully> flux_utils.py:170
INFO Loaded CLIP: <All keys matched successfully> flux_utils.py:170
INFO Loading state dict from flux_utils.py:215
/home/derp/fluxgym/models/clip/t5xxl_fp16.safetensors
INFO Loading state dict from flux_utils.py:215
/home/derp/fluxgym/models/clip/t5xxl_fp16.safetensors
INFO Loading state dict from flux_utils.py:215
/home/derp/fluxgym/models/clip/t5xxl_fp16.safetensors
INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:218
INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:218
INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:218
INFO Building AutoEncoder flux_utils.py:62
INFO Building AutoEncoder flux_utils.py:62
INFO Building AutoEncoder flux_utils.py:62
INFO Loading state dict from /home/derp/fluxgym/models/vae/ae.sft flux_utils.py:66
INFO Loading state dict from /home/derp/fluxgym/models/vae/ae.sft flux_utils.py:66
INFO Loading state dict from /home/derp/fluxgym/models/vae/ae.sft flux_utils.py:66
INFO Loaded AE: <All keys matched successfully> flux_utils.py:69
import network module: networks.lora_flux
INFO Loaded AE: <All keys matched successfully> flux_utils.py:69
INFO Loaded AE: <All keys matched successfully> flux_utils.py:69
2024-10-02 16:12:10 INFO [Dataset 0] train_util.py:2328
INFO caching latents with caching strategy. train_util.py:988
INFO checking cache validity... train_util.py:998
100%|██████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 364472.14it/s]
2024-10-02 16:12:10 INFO [Dataset 0] train_util.py:2328
2024-10-02 16:12:10 INFO [Dataset 0] train_util.py:2328
INFO caching latents with caching strategy. train_util.py:988
INFO caching latents with caching strategy. train_util.py:988
INFO checking cache validity... train_util.py:998
INFO checking cache validity... train_util.py:998
100%|██████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 437366.57it/s]
100%|████████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 2474.29it/s]
2024-10-02 16:12:11 INFO no latents to cache train_util.py:1038
[rank2]:[W1002 16:12:11.735902281 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W1002 16:12:11.739201920 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1002 16:12:11.859788105 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
2024-10-02 16:12:11 INFO move vae and unet to cpu to save memory flux_train_network.py:208
INFO move vae and unet to cpu to save memory flux_train_network.py:208
2024-10-02 16:12:11 INFO move vae and unet to cpu to save memory flux_train_network.py:208
INFO move text encoders to gpu flux_train_network.py:216
INFO move text encoders to gpu flux_train_network.py:216
INFO move text encoders to gpu flux_train_network.py:216
2024-10-02 16:12:33 INFO [Dataset 0] train_util.py:2349
INFO caching Text Encoder outputs with caching strategy. train_util.py:1111
INFO checking cache validity... train_util.py:1117
100%|████████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 1756.30it/s]
2024-10-02 16:12:34 INFO no Text Encoder outputs to cache train_util.py:1139
INFO cache Text Encoder outputs for sample prompt: flux_train_network.py:232
/home/derp/fluxgym/outputs/my-special-lora/sample_prompts.txt
INFO cache Text Encoder outputs for prompt: my-special-lora flux_train_network.py:243
2024-10-02 16:12:34 INFO [Dataset 0] train_util.py:2349
INFO caching Text Encoder outputs with caching strategy. train_util.py:1111
2024-10-02 16:12:34 INFO [Dataset 0] train_util.py:2349
INFO checking cache validity... train_util.py:1117
0%| | 0/317 [00:00<?, ?it/s] INFO caching Text Encoder outputs with caching strategy. train_util.py:1111
INFO checking cache validity... train_util.py:1117
100%|████████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 4201.50it/s]
100%|████████████████████████████████████████████████████████████████████████████| 317/317 [00:00<00:00, 4266.01it/s]
INFO no Text Encoder outputs to cache train_util.py:1139
INFO no Text Encoder outputs to cache train_util.py:1139
INFO cache Text Encoder outputs for sample prompt: flux_train_network.py:232
/home/derp/fluxgym/outputs/my-special-lora/sample_prompts.txt
INFO cache Text Encoder outputs for sample prompt: flux_train_network.py:232
/home/derp/fluxgym/outputs/my-special-lora/sample_prompts.txt
INFO cache Text Encoder outputs for prompt: my-special-lora flux_train_network.py:243
INFO cache Text Encoder outputs for prompt: my-special-lora flux_train_network.py:243
INFO cache Text Encoder outputs for prompt: flux_train_network.py:243
INFO cache Text Encoder outputs for prompt: flux_train_network.py:243
INFO cache Text Encoder outputs for prompt: flux_train_network.py:243
2024-10-02 16:12:36 INFO move t5XXL back to cpu flux_train_network.py:256
2024-10-02 16:12:36 INFO move t5XXL back to cpu flux_train_network.py:256
2024-10-02 16:12:36 INFO move t5XXL back to cpu flux_train_network.py:256
2024-10-02 16:12:41 INFO move vae and unet back to original device flux_train_network.py:261
INFO create LoRA network. base dim (rank): 8, alpha: 8.0 lora_flux.py:484
INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora_flux.py:485
INFO train all blocks only lora_flux.py:495
INFO create LoRA for Text Encoder 1: lora_flux.py:576
2024-10-02 16:12:41 INFO move vae and unet back to original device flux_train_network.py:261
INFO create LoRA network. base dim (rank): 8, alpha: 8.0 lora_flux.py:484
INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora_flux.py:485
INFO train all blocks only lora_flux.py:495
INFO create LoRA for Text Encoder 1: lora_flux.py:576
INFO create LoRA for Text Encoder 1: 72 modules. lora_flux.py:579
INFO create LoRA for Text Encoder 1: 72 modules. lora_flux.py:579
2024-10-02 16:12:41 INFO move vae and unet back to original device flux_train_network.py:261
INFO create LoRA network. base dim (rank): 8, alpha: 8.0 lora_flux.py:484
INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora_flux.py:485
INFO train all blocks only lora_flux.py:495
INFO create LoRA for Text Encoder 1: lora_flux.py:576
INFO create LoRA for Text Encoder 1: 72 modules. lora_flux.py:579
INFO create LoRA for FLUX all blocks: 304 modules. lora_flux.py:593
INFO enable LoRA for text encoder: 72 modules lora_flux.py:736
INFO enable LoRA for U-Net: 304 modules lora_flux.py:741
FLUX: Gradient checkpointing enabled. CPU offload: False
INFO Text Encoder 1 (CLIP-L): 72 modules, LR 0.0001 lora_flux.py:843
INFO use Adafactor optimizer | {'relative_step': False, 'scale_parameter': train_util.py:4541
False, 'warmup_init': False}
INFO create LoRA for FLUX all blocks: 304 modules. lora_flux.py:593
INFO enable LoRA for text encoder: 72 modules lora_flux.py:736
INFO enable LoRA for U-Net: 304 modules lora_flux.py:741
FLUX: Gradient checkpointing enabled. CPU offload: False
prepare optimizer, data loader etc.
INFO Text Encoder 1 (CLIP-L): 72 modules, LR 0.0001 lora_flux.py:843
INFO use Adafactor optimizer | {'relative_step': False, 'scale_parameter': train_util.py:4541
False, 'warmup_init': False}
override steps. steps for 40 epochs is / 指定エポックまでのステップ数: 12680
enable fp8 training for U-Net.
enable fp8 training for Text Encoder.
2024-10-02 16:12:42 INFO create LoRA for FLUX all blocks: 304 modules. lora_flux.py:593
INFO enable LoRA for text encoder: 72 modules lora_flux.py:736
INFO enable LoRA for U-Net: 304 modules lora_flux.py:741
FLUX: Gradient checkpointing enabled. CPU offload: False
INFO Text Encoder 1 (CLIP-L): 72 modules, LR 0.0001 lora_flux.py:843
INFO use Adafactor optimizer | {'relative_step': False, 'scale_parameter': train_util.py:4541
False, 'warmup_init': False}
2024-10-02 16:13:45 INFO prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set flux_train_network.py:464
embeddings to torch.bfloat16
[2024-10-02 16:13:45,963] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 3
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/derp/fluxgym/sd-scripts/flux_train_network.py", line 519, in <module>
[rank1]: trainer.train(args)
[rank1]: File "/home/derp/fluxgym/sd-scripts/train_network.py", line 590, in train
[rank1]: ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank1]: result = self._prepare_deepspeed(*args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank1]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank1]: engine = DeepSpeedEngine(args=args,
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 269, in __init__
[rank1]: self._configure_distributed_model(model)
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model
[rank1]: self.module.bfloat16()
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in bfloat16
[rank1]: return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: [Previous line repeated 3 more times]
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank1]: param_applied = fn(param)
[rank1]: ^^^^^^^^^
[rank1]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in <lambda>
[rank1]: return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank1]: ^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 1 has a total capacity of 15.89 GiB of which 67.12 MiB is free. Including non-PyTorch memory, this process has 15.82 GiB memory in use. Of the allocated memory 15.18 GiB is allocated by PyTorch, and 291.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2024-10-02 16:13:47 INFO prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set flux_train_network.py:464
embeddings to torch.bfloat16
2024-10-02 16:13:47 INFO prepare CLIP-L for fp8: set to torch.float8_e4m3fn, set flux_train_network.py:464
embeddings to torch.bfloat16
[2024-10-02 16:13:47,178] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-10-02 16:13:47,179] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 3
[2024-10-02 16:13:47,181] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 3
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/derp/fluxgym/sd-scripts/flux_train_network.py", line 519, in <module>
[rank0]: trainer.train(args)
[rank0]: File "/home/derp/fluxgym/sd-scripts/train_network.py", line 590, in train
[rank0]: ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank0]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 269, in __init__
[rank0]: self._configure_distributed_model(model)
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model
[rank0]: self.module.bfloat16()
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in bfloat16
[rank0]: return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: [Previous line repeated 3 more times]
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank0]: param_applied = fn(param)
[rank0]: ^^^^^^^^^
[rank0]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in <lambda>
[rank0]: return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank0]: ^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 15.89 GiB of which 67.12 MiB is free. Including non-PyTorch memory, this process has 15.82 GiB memory in use. Of the allocated memory 15.18 GiB is allocated by PyTorch, and 291.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/derp/fluxgym/sd-scripts/flux_train_network.py", line 519, in <module>
[rank2]: trainer.train(args)
[rank2]: File "/home/derp/fluxgym/sd-scripts/train_network.py", line 590, in train
[rank2]: ds_model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank2]: result = self._prepare_deepspeed(*args)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank2]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/__init__.py", line 193, in initialize
[rank2]: engine = DeepSpeedEngine(args=args,
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 269, in __init__
[rank2]: self._configure_distributed_model(model)
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model
[rank2]: self.module.bfloat16()
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in bfloat16
[rank2]: return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: [Previous line repeated 3 more times]
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank2]: param_applied = fn(param)
[rank2]: ^^^^^^^^^
[rank2]: File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1176, in <lambda>
[rank2]: return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
[rank2]: ^^^^^^^^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 2 has a total capacity of 15.89 GiB of which 67.12 MiB is free. Including non-PyTorch memory, this process has 15.82 GiB memory in use. Of the allocated memory 15.18 GiB is allocated by PyTorch, and 291.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W1002 16:13:48.643604082 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W1002 16:13:48.300000 1358 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1424 closing signal SIGTERM
W1002 16:13:48.303000 1358 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1426 closing signal SIGTERM
E1002 16:13:48.619000 1358 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1425) of binary: /home/derp/fluxgym/env/bin/python
Traceback (most recent call last):
File "/home/derp/fluxgym/env/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/home/derp/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/derp/fluxgym/env/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sd-scripts/flux_train_network.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-02_16:13:48
host : t7910.lab.local
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1425)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
With three P100 16GB GPUs installed in the system, the following exception is eventually thrown:
using the following config:
If a fourth GPU is installed, an RTX A4000, there is no OOM error message!