Training Error (Stuck at deadlock)

pranauv1 commented 1 month ago

I got stuck at 1st epoch, seems like a deadlock error. any ideas?

accelerate launch \
  --mixed_precision bf16 \
  --num_cpu_threads_per_process 1 \
  sd-scripts/flux_train_network.py \
  --pretrained_model_name_or_path "/kaggle/models/unet/flux1-dev.sft" \
  --clip_l "/kaggle/models/clip/clip_l.safetensors" \
  --t5xxl "/kaggle/models/clip/t5xxl_fp16.safetensors" \
  --ae "/kaggle/models/vae/ae.sft" \
  --cache_latents_to_disk \
  --save_model_as safetensors \
  --sdpa --persistent_data_loader_workers \
  --max_data_loader_n_workers 2 \
  --seed 42 \
  --gradient_checkpointing \
  --mixed_precision bf16 \
  --save_precision bf16 \
  --network_module networks.lora_flux \
  --network_dim 4 \
  --optimizer_type adafactor \
  --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" \
  --lr_scheduler constant_with_warmup \
  --max_grad_norm 0.0 \
  --learning_rate 8e-4 \
  --cache_text_encoder_outputs \
  --cache_text_encoder_outputs_to_disk \
  --fp8_base \
  --highvram \
  --max_train_epochs 7 \
  --save_every_n_epochs 5 \
  --dataset_config "/kaggle/working/fluxgym/dataset.toml" \
  --output_dir "/kaggle/working/fluxgym/outputs" \
  --output_name pranav-lora \
  --timestep_sampling shift \
  --discrete_flow_shift 3.1582 \
  --model_prediction_type raw \
  --guidance_scale 1 \
  --loss_type l2 \

Logs:

[2024-09-10 08:15:37] [INFO] enable fp8 training for Text Encoder.
[2024-09-10 08:17:37] [INFO] prepare CLIP-L for fp8: flux_train_network.py:464
[2024-09-10 08:17:37] [INFO] set to torch.float8_e4m3fn, set embeddings to torch.bfloat16
[2024-09-10 08:17:38] [INFO] running training / 学習開始
[2024-09-10 08:17:38] [INFO] num train images * repeats / 学習画像の数×繰り返し回数: 112
[2024-09-10 08:17:38] [INFO] num reg images / 正則化画像の数: 0
[2024-09-10 08:17:38] [INFO] num batches per epoch / 1epochのバッチ数: 112
[2024-09-10 08:17:38] [INFO] num epochs / epoch数: 10
[2024-09-10 08:17:38] [INFO] batch size per device / バッチサイズ: 1
[2024-09-10 08:17:38] [INFO] gradient accumulation steps / 勾配を合計するステップ数 = 1
[2024-09-10 08:17:38] [INFO] total optimization steps / 学習ステップ数: 1120
[2024-09-10 08:20:16] [INFO] steps: 0%|          | 0/1120 [00:00<?, ?it/s] unet dtype: train_network.py:1046
[2024-09-10 08:20:16] [INFO] torch.float8_e4m3fn, device: cuda:0
[2024-09-10 08:20:16] [INFO] text_encoder [0] dtype: train_network.py:1052
[2024-09-10 08:20:16] [INFO] torch.float8_e4m3fn, device: cuda:0
[2024-09-10 08:20:16] [INFO] text_encoder [1] dtype: torch.bfloat16, device: cpu
[2024-09-10 08:20:16] [INFO] epoch 1/10
[2024-09-10 08:20:16] [INFO] /opt/conda/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
[2024-09-10 08:20:16] [INFO] self.pid = os.fork()
[2024-09-10 08:20:16] [INFO] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2024-09-10 08:20:16] [INFO] To disable this warning, you can either:
[2024-09-10 08:20:16] [INFO] - Avoid using `tokenizers` before the fork if possible
[2024-09-10 08:20:16] [INFO] - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2024-09-10 08:20:16] [INFO] epoch is incremented. train_util.py:668
[2024-09-10 08:20:16] [INFO] current_epoch: 0, epoch: 1
[2024-09-10 08:20:22] [INFO] /opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
[2024-09-10 08:20:31] [INFO] /opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.

JohnsonXi commented 1 month ago

i got same problem,sometimes stuck.

danmayer commented 1 month ago

I saw this in my output

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2024-09-10 21:43:44] [INFO] To disable this warning, you can either:
[2024-09-10 21:43:44] [INFO] - Avoid using `tokenizers` before the fork if possible
[2024-09-10 21:43:44] [INFO] - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

and it was stuck as well... I fixed it by setting the env var below

TOKENIZERS_PARALLELISM=false python app.py

pranauv1 commented 1 month ago

Running the command separately solved the problem.

I prepared the dataset through Flux Gym and then ran the Kohya script from the command line, it had the same warnings mentioned above but the training finished successfully.

cocktailpeanut / fluxgym

Training Error (Stuck at deadlock) #44

Running the command separately solved the problem.