cocktailpeanut / fluxgym

Dead simple FLUX LoRA training UI with LOW VRAM support
733 stars 48 forks source link

CUDA Out of Memory Error Despite Sufficient GPU Memory #40

Open catlog66 opened 1 week ago

catlog66 commented 1 week ago

My GPU is 4060Ti with 16g VRAM and 32g RAM.I am encountering a CUDA Out of Memory error when training a network using the flux_train_network.py script, even though the system shows that there is sufficient available GPU memory. The issue occurs even after switching to a 12 GB memory mode. [2024-09-09 17:43:57] [INFO] Traceback (most recent call last): [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\sd-scripts\flux_train_network.py", line 519, in [2024-09-09 17:43:57] [INFO] trainer.train(args) [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\sd-scripts\train_network.py", line 402, in train [2024-09-09 17:43:57] [INFO] self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype) [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\sd-scripts\flux_train_network.py", line 218, in cache_text_encoder_outputs_if_needed [2024-09-09 17:43:57] [INFO] text_encoders[1].to(accelerator.device) [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\transformers\modeling_utils.py", line 2883, in to [2024-09-09 17:43:57] [INFO] return super().to(*args, **kwargs) [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\torch\nn\modules\module.py", line 1340, in to [2024-09-09 17:43:57] [INFO] return self._apply(convert) [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply [2024-09-09 17:43:57] [INFO] module._apply(fn) [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply [2024-09-09 17:43:57] [INFO] module._apply(fn) [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply [2024-09-09 17:43:57] [INFO] module._apply(fn) [2024-09-09 17:43:57] [INFO] [Previous line repeated 4 more times] [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\torch\nn\modules\module.py", line 927, in _apply [2024-09-09 17:43:57] [INFO] param_applied = fn(param) [2024-09-09 17:43:57] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\torch\nn\modules\module.py", line 1326, in convert [2024-09-09 17:43:57] [INFO] return t.to( [2024-09-09 17:43:57] [INFO] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 16.00 GiB of which 6.98 GiB is free. Of the allocated memory 7.87 GiB is allocated by PyTorch, and 14.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [2024-09-09 17:43:58] [INFO] Traceback (most recent call last): [2024-09-09 17:43:58] [INFO] File "c:\users\lzmmm\appdata\local\programs\python\python310\lib\runpy.py", line 196, in _run_module_as_main [2024-09-09 17:43:58] [INFO] return _run_code(code, main_globals, None, [2024-09-09 17:43:58] [INFO] File "c:\users\lzmmm\appdata\local\programs\python\python310\lib\runpy.py", line 86, in _run_code [2024-09-09 17:43:58] [INFO] exec(code, run_globals) [2024-09-09 17:43:58] [INFO] File "F:\Flux_Gym\fluxgym\env\Scripts\accelerate.exe__main__.py", line 7, in [2024-09-09 17:43:58] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main [2024-09-09 17:43:58] [INFO] args.func(args) [2024-09-09 17:43:58] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command [2024-09-09 17:43:58] [INFO] simple_launcher(args) [2024-09-09 17:43:58] [INFO] File "F:\Flux_Gym\fluxgym\env\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher [2024-09-09 17:43:58] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) [2024-09-09 17:43:58] [INFO] subprocess.CalledProcessError: Command '['F:\Flux_Gym\fluxgym\env\Scripts\python.exe', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', 'F:\Flux_Gym\fluxgym\models\unet\flux1-dev.sft', '--clip_l', 'F:\Flux_Gym\fluxgym\models\clip\clip_l.safetensors', '--t5xxl', 'F:\Flux_Gym\fluxgym\models\clip\t5xxl_fp16.safetensors', '--ae', 'F:\Flux_Gym\fluxgym\models\vae\ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '0', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adafactor', '--optimizer_args', 'relative_step=False', 'scale_parameter=False', 'warmup_init=False', '--split_mode', '--network_args', 'train_blocks=single', '--lr_scheduler', 'constant_with_warmup', '--max_grad_norm', '0.0', '--learning_rate', '8e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '16', '--save_every_n_epochs', '4', '--dataset_config', 'F:\Flux_Gym\fluxgym\dataset.toml', '--output_dir', 'F:\Flux_Gym\fluxgym\outputs', '--output_name', 'tuwai-lora', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1', '--loss_type', 'l2']' returned non-zero exit status 1. [2024-09-09 17:43:59] [ERROR] Command exited with code 1 [2024-09-09 17:43:59] [INFO] Runner: I also tried setting the environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True as recommended in the PyTorch documentation, but the issue persists.

Steps to Reproduce:

Run the following command with the provided configurations:

accelerate launch ^ --mixed_precision bf16 ^ --num_cpu_threads_per_process 1 ^ sd-scripts/flux_train_network.py ^ --pretrained_model_name_or_path "F:\Flux_Gym\fluxgym\models\unet\flux1-dev.sft" ^ --clip_l "F:\Flux_Gym\fluxgym\models\clip\clip_l.safetensors" ^ --t5xxl "F:\Flux_Gym\fluxgym\models\clip\t5xxl_fp16.safetensors" ^ --ae "F:\Flux_Gym\fluxgym\models\vae\ae.sft" ^ --cache_latents_to_disk ^ --save_model_as safetensors ^ --sdpa --persistent_data_loader_workers ^ --max_data_loader_n_workers 2 ^ --seed 0 ^ --gradient_checkpointing ^ --mixed_precision bf16 ^ --save_precision bf16 ^ --network_module networks.lora_flux ^ --network_dim 4 ^ --optimizer_type adafactor ^ --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" ^ --split_mode ^ --network_args "train_blocks=single" ^ --lr_scheduler constant_with_warmup ^ --max_grad_norm 0.0 ^ --learning_rate 8e-4 ^ --cache_text_encoder_outputs ^ --cache_text_encoder_outputs_to_disk ^ --fp8_base ^ --highvram ^ --max_train_epochs 16 ^ --save_every_n_epochs 4 ^ --dataset_config "F:\Flux_Gym\fluxgym\dataset.toml" ^ --output_dir "F:\Flux_Gym\fluxgym\outputs" ^ --output_name tuwai-lora ^ --timestep_sampling shift ^ --discrete_flow_shift 3.1582 ^ --model_prediction_type raw ^ --guidance_scale 1 ^ --loss_type l2 Observe the following error when running the script:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 16.00 GiB of which 6.98 GiB is free. Of the allocated memory 7.87 GiB is allocated by PyTorch, and 14.33 MiB is reserved by PyTorch but unallocated. Expected Behavior: The model should train without running out of memory, especially given that there appears to be sufficient free GPU memory available.

Actual Behavior: The training script fails with a CUDA Out of Memory error, even though the system shows there is around 6.98 GB of free memory.

utsavbhat commented 1 week ago

Getting same error when training with a 12 GB VRAM option. What I noticed is that the error seems to be getting due to lower RAM option. I have 32 GB of RAM and when I hit the train button, the VRAM AND RAM increases but the when the RAM gets maxed out even when VRAM has memory the error as mentioned above displays.

catlog66 commented 1 week ago

Getting same error when training with a 12 GB VRAM option. What I noticed is that the error seems to be getting due to lower RAM option. I have 32 GB of RAM and when I hit the train button, the VRAM AND RAM increases but the when the RAM gets maxed out even when VRAM has memory the error as mentioned above displays.

Try reducing the number of data sets

billysb commented 1 week ago

You can also increase your system swap / page file if its RAM related, however this will come at a very bad performance reduction which depending on how desperate you are it will or wont be worth the much longer wait.

catlog66 commented 1 week ago

You can also increase your system swap / page file if its RAM related, however this will come at a very bad performance reduction which depending on how desperate you are it will or wont be worth the much longer wait.

In fact, the issue was with insufficient GPU memory, not system memory. I replaced the underlying large model with one that uses FP8 precision, and now the error is gone.

igor-kurenkov commented 1 week ago

I have the same error. Decreased number of pictures to 4, the same. Windows 10 x64, RTX 4060 Ti 16GB VRAM, 32 GB RAM

igor-kurenkov commented 1 week ago

I replaced the underlying large model with one that uses FP8 precision, and now the error is gone

How exactly did you do it? I replaced it with flux-dev-fp8.safetensors and edited the --pretrained_model_name_or_path Now i get another error

[2024-09-11 23:19:12] [INFO] 2024-09-11 23:19:12 INFO     move vae and unet to cpu flux_train_network.py:208
[2024-09-11 23:19:12] [INFO] to save memory
[2024-09-11 23:19:12] [INFO] Traceback (most recent call last):
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\flux_train_network.py", line 519, in <module>
[2024-09-11 23:19:12] [INFO] trainer.train(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\train_network.py", line 402, in train
[2024-09-11 23:19:12] [INFO] self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\flux_train_network.py", line 212, in cache_text_encoder_outputs_if_needed
[2024-09-11 23:19:12] [INFO] unet.to("cpu")
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 1340, in to
[2024-09-11 23:19:12] [INFO] return self._apply(convert)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
[2024-09-11 23:19:12] [INFO] module._apply(fn)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in _apply
[2024-09-11 23:19:12] [INFO] param_applied = fn(param)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 1333, in convert
[2024-09-11 23:19:12] [INFO] raise NotImplementedError(
[2024-09-11 23:19:12] [INFO] NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
[2024-09-11 23:19:12] [INFO] Traceback (most recent call last):
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\python310\lib\runpy.py", line 196, in _run_module_as_main
[2024-09-11 23:19:12] [INFO] return _run_code(code, main_globals, None,
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\python310\lib\runpy.py", line 86, in _run_code
[2024-09-11 23:19:12] [INFO] exec(code, run_globals)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
[2024-09-11 23:19:12] [INFO] args.func(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
[2024-09-11 23:19:12] [INFO] simple_launcher(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
[2024-09-11 23:19:12] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
[2024-09-11 23:19:12] [INFO] subprocess.CalledProcessError: Command '['E:\\_SD\\FluxGym-AINetSD\\venv\\Scripts\\python.exe', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', 'E:\\_SD\\FluxGym-AINetSD\\models\\unet\\flux-dev-fp8.safetensors', '--clip_l', 'E:\\_SD\\FluxGym-AINetSD\\models\\clip\\clip_l.safetensors', '--t5xxl', 'E:\\_SD\\FluxGym-AINetSD\\models\\clip\\t5xxl_fp16.safetensors', '--ae', 'E:\\_SD\\FluxGym-AINetSD\\models\\vae\\ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '42', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adafactor', '--optimizer_args', 'relative_step=False', 'scale_parameter=False', 'warmup_init=False', '--lr_scheduler', 'constant_with_warmup', '--max_grad_norm', '0.0', '--learning_rate', '8e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '8', '--save_every_n_epochs', '4', '--dataset_config', 'E:\\_SD\\FluxGym-AINetSD\\dataset.toml', '--output_dir', 'E:\\_SD\\FluxGym-AINetSD\\outputs', '--output_name', 'kurigo', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1', '--loss_type', 'l2']' returned non-zero exit status 1.
[2024-09-11 23:19:13] [ERROR] Command exited with code 1
[2024-09-11 23:19:13] [INFO] Исполнитель: <LogsViewRunner nb_logs=3239 exit_code=1>
catlog66 commented 1 week ago

I replaced the underlying large model with one that uses FP8 precision, and now the error is gone

How exactly did you do it? I replaced it with flux-dev-fp8.safetensors and edited the --pretrained_model_name_or_path Now i get another error

[2024-09-11 23:19:12] [INFO] 2024-09-11 23:19:12 INFO     move vae and unet to cpu flux_train_network.py:208
[2024-09-11 23:19:12] [INFO] to save memory
[2024-09-11 23:19:12] [INFO] Traceback (most recent call last):
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\flux_train_network.py", line 519, in <module>
[2024-09-11 23:19:12] [INFO] trainer.train(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\train_network.py", line 402, in train
[2024-09-11 23:19:12] [INFO] self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\flux_train_network.py", line 212, in cache_text_encoder_outputs_if_needed
[2024-09-11 23:19:12] [INFO] unet.to("cpu")
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 1340, in to
[2024-09-11 23:19:12] [INFO] return self._apply(convert)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
[2024-09-11 23:19:12] [INFO] module._apply(fn)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in _apply
[2024-09-11 23:19:12] [INFO] param_applied = fn(param)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 1333, in convert
[2024-09-11 23:19:12] [INFO] raise NotImplementedError(
[2024-09-11 23:19:12] [INFO] NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
[2024-09-11 23:19:12] [INFO] Traceback (most recent call last):
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\python310\lib\runpy.py", line 196, in _run_module_as_main
[2024-09-11 23:19:12] [INFO] return _run_code(code, main_globals, None,
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\python310\lib\runpy.py", line 86, in _run_code
[2024-09-11 23:19:12] [INFO] exec(code, run_globals)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
[2024-09-11 23:19:12] [INFO] args.func(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
[2024-09-11 23:19:12] [INFO] simple_launcher(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
[2024-09-11 23:19:12] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
[2024-09-11 23:19:12] [INFO] subprocess.CalledProcessError: Command '['E:\\_SD\\FluxGym-AINetSD\\venv\\Scripts\\python.exe', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', 'E:\\_SD\\FluxGym-AINetSD\\models\\unet\\flux-dev-fp8.safetensors', '--clip_l', 'E:\\_SD\\FluxGym-AINetSD\\models\\clip\\clip_l.safetensors', '--t5xxl', 'E:\\_SD\\FluxGym-AINetSD\\models\\clip\\t5xxl_fp16.safetensors', '--ae', 'E:\\_SD\\FluxGym-AINetSD\\models\\vae\\ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '42', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adafactor', '--optimizer_args', 'relative_step=False', 'scale_parameter=False', 'warmup_init=False', '--lr_scheduler', 'constant_with_warmup', '--max_grad_norm', '0.0', '--learning_rate', '8e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '8', '--save_every_n_epochs', '4', '--dataset_config', 'E:\\_SD\\FluxGym-AINetSD\\dataset.toml', '--output_dir', 'E:\\_SD\\FluxGym-AINetSD\\outputs', '--output_name', 'kurigo', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1', '--loss_type', 'l2']' returned non-zero exit status 1.
[2024-09-11 23:19:13] [ERROR] Command exited with code 1
[2024-09-11 23:19:13] [INFO] Исполнитель: <LogsViewRunner nb_logs=3239 exit_code=1>

Did you modify any training parameters? I used the default parameters and the fp8 precision model, but there was obvious noise after lora training. I plan to put it in the cloud for training. The training efficiency of 4060ti is still too low.

catlog66 commented 1 week ago

I replaced the underlying large model with one that uses FP8 precision, and now the error is gone

How exactly did you do it? I replaced it with flux-dev-fp8.safetensors and edited the --pretrained_model_name_or_path Now i get another error

[2024-09-11 23:19:12] [INFO] 2024-09-11 23:19:12 INFO     move vae and unet to cpu flux_train_network.py:208
[2024-09-11 23:19:12] [INFO] to save memory
[2024-09-11 23:19:12] [INFO] Traceback (most recent call last):
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\flux_train_network.py", line 519, in <module>
[2024-09-11 23:19:12] [INFO] trainer.train(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\train_network.py", line 402, in train
[2024-09-11 23:19:12] [INFO] self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\flux_train_network.py", line 212, in cache_text_encoder_outputs_if_needed
[2024-09-11 23:19:12] [INFO] unet.to("cpu")
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 1340, in to
[2024-09-11 23:19:12] [INFO] return self._apply(convert)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
[2024-09-11 23:19:12] [INFO] module._apply(fn)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in _apply
[2024-09-11 23:19:12] [INFO] param_applied = fn(param)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 1333, in convert
[2024-09-11 23:19:12] [INFO] raise NotImplementedError(
[2024-09-11 23:19:12] [INFO] NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
[2024-09-11 23:19:12] [INFO] Traceback (most recent call last):
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\python310\lib\runpy.py", line 196, in _run_module_as_main
[2024-09-11 23:19:12] [INFO] return _run_code(code, main_globals, None,
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\python310\lib\runpy.py", line 86, in _run_code
[2024-09-11 23:19:12] [INFO] exec(code, run_globals)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
[2024-09-11 23:19:12] [INFO] args.func(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
[2024-09-11 23:19:12] [INFO] simple_launcher(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
[2024-09-11 23:19:12] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
[2024-09-11 23:19:12] [INFO] subprocess.CalledProcessError: Command '['E:\\_SD\\FluxGym-AINetSD\\venv\\Scripts\\python.exe', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', 'E:\\_SD\\FluxGym-AINetSD\\models\\unet\\flux-dev-fp8.safetensors', '--clip_l', 'E:\\_SD\\FluxGym-AINetSD\\models\\clip\\clip_l.safetensors', '--t5xxl', 'E:\\_SD\\FluxGym-AINetSD\\models\\clip\\t5xxl_fp16.safetensors', '--ae', 'E:\\_SD\\FluxGym-AINetSD\\models\\vae\\ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '42', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adafactor', '--optimizer_args', 'relative_step=False', 'scale_parameter=False', 'warmup_init=False', '--lr_scheduler', 'constant_with_warmup', '--max_grad_norm', '0.0', '--learning_rate', '8e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '8', '--save_every_n_epochs', '4', '--dataset_config', 'E:\\_SD\\FluxGym-AINetSD\\dataset.toml', '--output_dir', 'E:\\_SD\\FluxGym-AINetSD\\outputs', '--output_name', 'kurigo', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1', '--loss_type', 'l2']' returned non-zero exit status 1.
[2024-09-11 23:19:13] [ERROR] Command exited with code 1
[2024-09-11 23:19:13] [INFO] Исполнитель: <LogsViewRunner nb_logs=3239 exit_code=1>

If this error occurs, your memory may be full. Try reducing background programs or using the above method to increase virtual memory.

igor-kurenkov commented 1 week ago

I replaced the underlying large model with one that uses FP8 precision, and now the error is gone

How exactly did you do it? I replaced it with flux-dev-fp8.safetensors and edited the --pretrained_model_name_or_path Now i get another error

[2024-09-11 23:19:12] [INFO] 2024-09-11 23:19:12 INFO     move vae and unet to cpu flux_train_network.py:208
[2024-09-11 23:19:12] [INFO] to save memory
[2024-09-11 23:19:12] [INFO] Traceback (most recent call last):
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\flux_train_network.py", line 519, in <module>
[2024-09-11 23:19:12] [INFO] trainer.train(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\train_network.py", line 402, in train
[2024-09-11 23:19:12] [INFO] self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\sd-scripts\flux_train_network.py", line 212, in cache_text_encoder_outputs_if_needed
[2024-09-11 23:19:12] [INFO] unet.to("cpu")
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 1340, in to
[2024-09-11 23:19:12] [INFO] return self._apply(convert)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
[2024-09-11 23:19:12] [INFO] module._apply(fn)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in _apply
[2024-09-11 23:19:12] [INFO] param_applied = fn(param)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\torch\nn\modules\module.py", line 1333, in convert
[2024-09-11 23:19:12] [INFO] raise NotImplementedError(
[2024-09-11 23:19:12] [INFO] NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
[2024-09-11 23:19:12] [INFO] Traceback (most recent call last):
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\python310\lib\runpy.py", line 196, in _run_module_as_main
[2024-09-11 23:19:12] [INFO] return _run_code(code, main_globals, None,
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\python310\lib\runpy.py", line 86, in _run_code
[2024-09-11 23:19:12] [INFO] exec(code, run_globals)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
[2024-09-11 23:19:12] [INFO] args.func(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
[2024-09-11 23:19:12] [INFO] simple_launcher(args)
[2024-09-11 23:19:12] [INFO] File "E:\_SD\FluxGym-AINetSD\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
[2024-09-11 23:19:12] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
[2024-09-11 23:19:12] [INFO] subprocess.CalledProcessError: Command '['E:\\_SD\\FluxGym-AINetSD\\venv\\Scripts\\python.exe', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', 'E:\\_SD\\FluxGym-AINetSD\\models\\unet\\flux-dev-fp8.safetensors', '--clip_l', 'E:\\_SD\\FluxGym-AINetSD\\models\\clip\\clip_l.safetensors', '--t5xxl', 'E:\\_SD\\FluxGym-AINetSD\\models\\clip\\t5xxl_fp16.safetensors', '--ae', 'E:\\_SD\\FluxGym-AINetSD\\models\\vae\\ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '42', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adafactor', '--optimizer_args', 'relative_step=False', 'scale_parameter=False', 'warmup_init=False', '--lr_scheduler', 'constant_with_warmup', '--max_grad_norm', '0.0', '--learning_rate', '8e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '8', '--save_every_n_epochs', '4', '--dataset_config', 'E:\\_SD\\FluxGym-AINetSD\\dataset.toml', '--output_dir', 'E:\\_SD\\FluxGym-AINetSD\\outputs', '--output_name', 'kurigo', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1', '--loss_type', 'l2']' returned non-zero exit status 1.
[2024-09-11 23:19:13] [ERROR] Command exited with code 1
[2024-09-11 23:19:13] [INFO] Исполнитель: <LogsViewRunner nb_logs=3239 exit_code=1>

Did you modify any training parameters? I used the default parameters and the fp8 precision model, but there was obvious noise after lora training. I plan to put it in the cloud for training. The training efficiency of 4060ti is still too low.

I tried both defaults and redued steps/epochs - doesnt matter. Then i"ve set the swapfile to 30GB and the trained a test lora on a few images successfully. Then i tried about 15 images at 1024 pix and the training stuck forever without any error