Open shaun-ba opened 1 month ago
So I have one saved safetensors, I'm not sure what is supposed to happen after the logs below? I expected 16 files for each epoch, so was this the first one then it failed to resume?
steps: 100%|█████████▉| 2718/2720 [1:53:38<00:05, 2.51s/it, avr_loss=0.332] steps: 100%|█████████▉| 2718/2720 [1:53:38<00:05, 2.51s/it, avr_loss=0.331] steps: 100%|█████████▉| 2719/2720 [1:53:40<00:02, 2.51s/it, avr_loss=0.331] steps: 100%|█████████▉| 2719/2720 [1:53:40<00:02, 2.51s/it, avr_loss=0.332] steps: 100%|██████████| 2720/2720 [1:53:42<00:00, 2.51s/it, avr_loss=0.332] steps: 100%|██████████| 2720/2720 [1:53:42<00:00, 2.51s/it, avr_loss=0.332] [2024-10-21 11:16:10] [INFO] saving checkpoint: /home/1.safetensors [2024-10-21 11:16:10] [INFO] 2024-10-21 11:16:10 INFO model saved. train_network.py:1313 [2024-10-21 11:16:10] [INFO] steps: 100%|██████████| 2720/2720 [1:53:43<00:00, 2.51s/it, avr_loss=0.332] [2024-10-21 11:17:04] [INFO] Traceback (most recent call last): [2024-10-21 11:17:04] [INFO] File "/training/fluxgym/env/bin/accelerate", line 8, in <module> [2024-10-21 11:17:04] [INFO] sys.exit(main()) [2024-10-21 11:17:04] [INFO] ^^^^^^ [2024-10-21 11:17:04] [INFO] File "/training/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main [2024-10-21 11:17:04] [INFO] args.func(args) [2024-10-21 11:17:04] [INFO] File "/training/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1106, in launch_command [2024-10-21 11:17:04] [INFO] simple_launcher(args) [2024-10-21 11:17:04] [INFO] File "/training/fluxgym/env/lib/python3.12/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher [2024-10-21 11:17:04] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) [2024-10-21 11:17:04] [INFO] subprocess.CalledProcessError: Command '['/training/fluxgym/env/bin/python3', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', '/training/fluxgym/models/unet/flux1-dev.sft', '--clip_l', '/training/fluxgym/models/clip/clip_l.safetensors', '--t5xxl', '/training/fluxgym/models/clip/t5xxl_fp16.safetensors', '--ae', '/training/fluxgym/models/vae/ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '42', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adamw8bit', '--sample_prompts=/sample_prompts.txt', '--sample_every_n_steps=64', '--learning_rate', '8e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '8', '--save_every_n_epochs', '4', '--dataset_config', '/dataset.toml', '--output_dir', '/', '--output_name', '1', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1', '--loss_type', 'l2']' died with <Signals.SIGSEGV: 11>. [2024-10-21 11:17:07] [ERROR] Command exited with code 1 [2024-10-21 11:17:07] [INFO] Runner: <LogsViewRunner nb_logs=4059 exit_code=1>
So I have one saved safetensors, I'm not sure what is supposed to happen after the logs below? I expected 16 files for each epoch, so was this the first one then it failed to resume?