unable to train, - Githubissues

cocktailpeanut / fluxgym

Dead simple FLUX LoRA training UI with LOW VRAM support

624 stars 37 forks source link

unable to train, #8

Open afrofail opened 1 week ago

afrofail commented 1 week ago

I have an Nvidia 4070 Super 12GB GPU and followed all the installation steps correctly, including trying the Easy Pinokio installer. Despite this, I keep encountering the same error at the end. The script is running with the 'venv' environment activated, and both the required dependencies and Nightly PyTorch are installed.

[2024-09-07 00:45:50] [INFO] F:\fluxgym\env\Lib\site-packages\torch\autograd\graph.py:818: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cudnn\MHA.cpp:672.) [2024-09-07 00:45:50] [INFO] return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

ircrp commented 1 week ago

Same here, on 4090 NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 OS: Ubuntu 22.10 Brand new conda environment with Python 3.11.9 Torch version: 2.5.0.dev20240906+cu121

Seems to have crashed following 150 steps of generation

steps:   6%|▌         | 150/2560 [02:18<37:05,  1.08it/s, avr_loss=nan]Traceback (most recent call last):
[2024-09-07 08:47:52] [INFO] File "/home/me/ai/fluxgym/sd-scripts/flux_train_network.py", line 519, in <module>
[2024-09-07 08:47:52] [INFO] trainer.train(args)
[2024-09-07 08:47:52] [INFO] File "/home/me/ai/fluxgym/sd-scripts/train_network.py", line 1171, in train
[2024-09-07 08:47:52] [INFO] accelerator.backward(loss)
[2024-09-07 08:47:52] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/accelerate/accelerator.py", line 2159, in backward
[2024-09-07 08:47:52] [INFO] loss.backward(**kwargs)
[2024-09-07 08:47:52] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/torch/_tensor.py", line 581, in backward
[2024-09-07 08:47:52] [INFO] torch.autograd.backward(
[2024-09-07 08:47:52] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
[2024-09-07 08:47:52] [INFO] _engine_run_backward(
[2024-09-07 08:47:52] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/torch/autograd/graph.py", line 818, in _engine_run_backward
[2024-09-07 08:47:52] [INFO] return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[2024-09-07 08:47:52] [INFO] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-07 08:47:52] [INFO] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[2024-09-07 08:47:53] [INFO] steps:   6%|▌         | 150/2560 [02:19<37:19,  1.08it/s, avr_loss=nan]
[2024-09-07 08:47:54] [INFO] Traceback (most recent call last):
[2024-09-07 08:47:54] [INFO] File "/home/me/anaconda3/envs/fluxgym/bin/accelerate", line 8, in <module>
[2024-09-07 08:47:54] [INFO] sys.exit(main())
[2024-09-07 08:47:54] [INFO] ^^^^^^
[2024-09-07 08:47:54] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
[2024-09-07 08:47:54] [INFO] args.func(args)
[2024-09-07 08:47:54] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
[2024-09-07 08:47:54] [INFO] simple_launcher(args)
[2024-09-07 08:47:54] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
[2024-09-07 08:47:54] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
[2024-09-07 08:47:54] [INFO] subprocess.CalledProcessError: Command '['/home/me/anaconda3/envs/fluxgym/bin/python3.11', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', '/home/me/ai/fluxgym/models/unet/flux1-dev.sft', '--clip_l', '/home/me/ai/fluxgym/models/clip/clip_l.safetensors', '--t5xxl', '/home/me/ai/fluxgym/models/clip/t5xxl_fp16.safetensors', '--ae', '/home/me/ai/fluxgym/models/vae/ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '42', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adamw8bit', '--learning_rate', '1e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '16', '--save_every_n_epochs', '4', '--dataset_config', '/home/me/ai/fluxgym/dataset.toml', '--output_dir', '/home/me/ai/fluxgym/outputs', '--output_name', 'adiv1', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1.0', '--loss_type', 'l2']' returned non-zero exit status 1.
[2024-09-07 08:47:54] [ERROR] Command exited with code 1
[2024-09-07 08:47:54] [INFO] Runner: <LogsViewRunner nb_logs=161 exit_code=1>

CRCODE22 commented 1 week ago

I have similar error with Nvidia RTX 4060 TI 16GB VRAM: K:\Users\CRCODE22\pinokio\api\fluxgym.git\env\lib\site-packages\torch\autograd\graph.py:818: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cudnn\MHA.cpp:672.) [2024-09-07 11:23:05] [INFO] return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass error.txt

tanuki-create commented 1 week ago

I followed official sd-scripts README. I reinstalled PyTorch (2.4.0) as following the step from the below instruction. I think the error is gone.

https://github.com/kohya-ss/sd-scripts/tree/sd3?tab=readme-ov-file#flux1-training-wip

FLUX.1 training (WIP) This feature is experimental. The options and the training script may change in the future. Please let us know if you have any idea to improve the training.

Please update PyTorch to 2.4.0. We have tested with torch==2.4.0 and torchvision==0.19.0 with CUDA 12.4. We also updated accelerate to 0.33.0 just to be safe. requirements.txt is also updated, so please update the requirements.

The command to install PyTorch is as follows: pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

afrofail commented 1 week ago

I followed official sd-scripts README. I reinstalled PyTorch (2.4.0) as following the step from the below instruction. I think the error is gone.

https://github.com/kohya-ss/sd-scripts/tree/sd3?tab=readme-ov-file#flux1-training-wip

FLUX.1 training (WIP) This feature is experimental. The options and the training script may change in the future. Please let us know if you have any idea to improve the training.

Please update PyTorch to 2.4.0. We have tested with torch==2.4.0 and torchvision==0.19.0 with CUDA 12.4. We also updated accelerate to 0.33.0 just to be safe. requirements.txt is also updated, so please update the requirements.

The command to install PyTorch is as follows: pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

The install steps prompted on the front README says to install the nightly build which is a dev torch 2.5.0 not 2.4.0.

I just downgraded both Torch and Torchvision to 2.4.0 and 0.19.0

Now I'm getting a new error

[2024-09-07 03:20:06] [INFO] F:\fluxgym\env\Lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [2024-09-07 03:20:06] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

CRCODE22 commented 1 week ago

I followed official sd-scripts README. I reinstalled PyTorch (2.4.0) as following the step from the below instruction. I think the error is gone. https://github.com/kohya-ss/sd-scripts/tree/sd3?tab=readme-ov-file#flux1-training-wip FLUX.1 training (WIP) This feature is experimental. The options and the training script may change in the future. Please let us know if you have any idea to improve the training. Please update PyTorch to 2.4.0. We have tested with torch==2.4.0 and torchvision==0.19.0 with CUDA 12.4. We also updated accelerate to 0.33.0 just to be safe. requirements.txt is also updated, so please update the requirements. The command to install PyTorch is as follows: pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

The install steps prompted on the front README says to install the nightly build which is a dev torch 2.5.0 not 2.4.0.

I just downgraded both Torch and Torchvision to 2.4.0 and 0.19.0

Now I'm getting a new error

[2024-09-07 03:20:06] [INFO] F:\fluxgym\env\Lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [2024-09-07 03:20:06] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

Does that prevent your lora training?

afrofail commented 1 week ago

I followed official sd-scripts README. I reinstalled PyTorch (2.4.0) as following the step from the below instruction. I think the error is gone. https://github.com/kohya-ss/sd-scripts/tree/sd3?tab=readme-ov-file#flux1-training-wip FLUX.1 training (WIP) This feature is experimental. The options and the training script may change in the future. Please let us know if you have any idea to improve the training. Please update PyTorch to 2.4.0. We have tested with torch==2.4.0 and torchvision==0.19.0 with CUDA 12.4. We also updated accelerate to 0.33.0 just to be safe. requirements.txt is also updated, so please update the requirements. The command to install PyTorch is as follows: pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

The install steps prompted on the front README says to install the nightly build which is a dev torch 2.5.0 not 2.4.0. I just downgraded both Torch and Torchvision to 2.4.0 and 0.19.0 Now I'm getting a new error [2024-09-07 03:20:06] [INFO] F:\fluxgym\env\Lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [2024-09-07 03:20:06] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

Does that prevent your lora training?

Yes, there is no training steps that begin, this is the last prompt and is just stagnant.

[2024-09-07 03:20:06] [INFO] F:\fluxgym\env\Lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [2024-09-07 03:20:06] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

CRCODE22 commented 1 week ago

I followed official sd-scripts README. I reinstalled PyTorch (2.4.0) as following the step from the below instruction. I think the error is gone. https://github.com/kohya-ss/sd-scripts/tree/sd3?tab=readme-ov-file#flux1-training-wip FLUX.1 training (WIP) This feature is experimental. The options and the training script may change in the future. Please let us know if you have any idea to improve the training. Please update PyTorch to 2.4.0. We have tested with torch==2.4.0 and torchvision==0.19.0 with CUDA 12.4. We also updated accelerate to 0.33.0 just to be safe. requirements.txt is also updated, so please update the requirements. The command to install PyTorch is as follows: pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

The install steps prompted on the front README says to install the nightly build which is a dev torch 2.5.0 not 2.4.0. I just downgraded both Torch and Torchvision to 2.4.0 and 0.19.0 Now I'm getting a new error [2024-09-07 03:20:06] [INFO] F:\fluxgym\env\Lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [2024-09-07 03:20:06] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

Does that prevent your lora training?

Yes, there is no training steps that begin, this is the last prompt and is just stagnant.

[2024-09-07 03:20:06] [INFO] F:\fluxgym\env\Lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [2024-09-07 03:20:06] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

Same error here and it remains stuck I hope @cocktailpeanut will fix this soon!

cocktailpeanut commented 1 week ago

I have similar error with Nvidia RTX 4060 TI 16GB VRAM: K:\Users\CRCODE22\pinokio\api\fluxgym.git\env\lib\site-packages\torch\autograd\graph.py:818: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cudnn\MHA.cpp:672.) [2024-09-07 11:23:05] [INFO] return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

I think this is a warning and not an error. Unless the process crashes it is still going.

Even I see this warning message but the training works fine. The reason it is not updating quickly is because the script seems to only print all the updates after each epoch (instead of each step).

This means it takes quite some time before the first progress is printed on the screen. So unless the program completely crashes and you don't see any VRAM usage on your task manager, try to keep it running and see if it updates.

cocktailpeanut commented 1 week ago

Same here, on 4090 NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 OS: Ubuntu 22.10 Brand new conda environment with Python 3.11.9 Torch version: 2.5.0.dev20240906+cu121

Seems to have crashed following 150 steps of generation

steps:   6%|▌         | 150/2560 [02:18<37:05,  1.08it/s, avr_loss=nan]Traceback (most recent call last):
[2024-09-07 08:47:52] [INFO] File "/home/me/ai/fluxgym/sd-scripts/flux_train_network.py", line 519, in <module>
[2024-09-07 08:47:52] [INFO] trainer.train(args)
[2024-09-07 08:47:52] [INFO] File "/home/me/ai/fluxgym/sd-scripts/train_network.py", line 1171, in train
[2024-09-07 08:47:52] [INFO] accelerator.backward(loss)
[2024-09-07 08:47:52] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/accelerate/accelerator.py", line 2159, in backward
[2024-09-07 08:47:52] [INFO] loss.backward(**kwargs)
[2024-09-07 08:47:52] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/torch/_tensor.py", line 581, in backward
[2024-09-07 08:47:52] [INFO] torch.autograd.backward(
[2024-09-07 08:47:52] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
[2024-09-07 08:47:52] [INFO] _engine_run_backward(
[2024-09-07 08:47:52] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/torch/autograd/graph.py", line 818, in _engine_run_backward
[2024-09-07 08:47:52] [INFO] return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[2024-09-07 08:47:52] [INFO] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-07 08:47:52] [INFO] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[2024-09-07 08:47:53] [INFO] steps:   6%|▌         | 150/2560 [02:19<37:19,  1.08it/s, avr_loss=nan]
[2024-09-07 08:47:54] [INFO] Traceback (most recent call last):
[2024-09-07 08:47:54] [INFO] File "/home/me/anaconda3/envs/fluxgym/bin/accelerate", line 8, in <module>
[2024-09-07 08:47:54] [INFO] sys.exit(main())
[2024-09-07 08:47:54] [INFO] ^^^^^^
[2024-09-07 08:47:54] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
[2024-09-07 08:47:54] [INFO] args.func(args)
[2024-09-07 08:47:54] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
[2024-09-07 08:47:54] [INFO] simple_launcher(args)
[2024-09-07 08:47:54] [INFO] File "/home/me/anaconda3/envs/fluxgym/lib/python3.11/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
[2024-09-07 08:47:54] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
[2024-09-07 08:47:54] [INFO] subprocess.CalledProcessError: Command '['/home/me/anaconda3/envs/fluxgym/bin/python3.11', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', '/home/me/ai/fluxgym/models/unet/flux1-dev.sft', '--clip_l', '/home/me/ai/fluxgym/models/clip/clip_l.safetensors', '--t5xxl', '/home/me/ai/fluxgym/models/clip/t5xxl_fp16.safetensors', '--ae', '/home/me/ai/fluxgym/models/vae/ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '42', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adamw8bit', '--learning_rate', '1e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '16', '--save_every_n_epochs', '4', '--dataset_config', '/home/me/ai/fluxgym/dataset.toml', '--output_dir', '/home/me/ai/fluxgym/outputs', '--output_name', 'adiv1', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1.0', '--loss_type', 'l2']' returned non-zero exit status 1.
[2024-09-07 08:47:54] [ERROR] Command exited with code 1
[2024-09-07 08:47:54] [INFO] Runner: <LogsViewRunner nb_logs=161 exit_code=1>

This is a different problem than the others because this one actually fails and crashes.

I have experienced this a few times too, and don't know when it exactly happens. But when I run the exact same thing one more time it works. So I've been assuming it's some edge case that has to do with the script.

The good news is it at least was training for a bit before it crashed, which is way better than it just failing to begin with.

So I recommend just trying one more time and see if it works this time. Let me know how it goes

cocktailpeanut commented 1 week ago

Oh also, by the way everyone on this thread, I pushed a fix a few hours ago for customized training https://github.com/cocktailpeanut/fluxgym/commit/a118913e3b18c91d143ec334662cca3ccd1859a4

Basically if you were trying anything other than the default config it was not actually picking up those settings, so for example if you tried the 12G or 16G vram option, it may have been just trying to use the default 20G vram option.

So just to make sure, try pulling in the changes and retry.

CRCODE22 commented 1 week ago

Oh also, by the way everyone on this thread, I pushed a fix a few hours ago for customized training a118913

Basically if you were trying anything other than the default config it was not actually picking up those settings, so for example if you tried the 12G or 16G vram option, it may have been just trying to use the default 20G vram option.

So just to make sure, try pulling in the changes and retry.

Thank you I am doing a clean install and will report if it works.

CRCODE22 commented 1 week ago

@cocktailpeanut it works now thank you.

working.txt | 256/4096 [19:46<4:56:36, 4.63s/it, avr_loss=0.344] [2024-09-07 14:58:26] [INFO] epoch 2/16

afrofail commented 1 week ago

I have similar error with Nvidia RTX 4060 TI 16GB VRAM: K:\Users\CRCODE22\pinokio\api\fluxgym.git\env\lib\site-packages\torch\autograd\graph.py:818: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cudnn\MHA.cpp:672.) [2024-09-07 11:23:05] [INFO] return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

I think this is a warning and not an error. Unless the process crashes it is still going.

Even I see this warning message but the training works fine. The reason it is not updating quickly is because the script seems to only print all the updates after each epoch (instead of each step).

This means it takes quite some time before the first progress is printed on the screen. So unless the program completely crashes and you don't see any VRAM usage on your task manager, try to keep it running and see if it updates.

Hey, thank you for the prompt response and trying to help troubleshoot this issue. I followed your instructions. It has been over 30 minutes with the training running, it doesn't seem to resolve, no GPU usage in task manager from the Command Prompt. I'm unsure how to go about this, as I have mentioned, I reinstalled this multiple times, step by step and still have this code. The actual command prompt only last shows the Florence 2 captioning, I don't know if training data should appear in there as well? I did try both PowerShell and CMD.

[2024-09-07 17:30:25] [INFO] steps: 0%| | 0/3200 [00:00<?, ?it/s]2024-09-07 17:30:25 INFO unet dtype: train_network.py:1046 [2024-09-07 17:30:25] [INFO] torch.float8_e4m3fn, device: [2024-09-07 17:30:25] [INFO] cuda:0 [2024-09-07 17:30:25] [INFO] INFO text_encoder [0] dtype: train_network.py:1052 [2024-09-07 17:30:25] [INFO] torch.float8_e4m3fn, device: [2024-09-07 17:30:25] [INFO] cuda:0 [2024-09-07 17:30:25] [INFO] INFO text_encoder [1] dtype: train_network.py:1052 [2024-09-07 17:30:25] [INFO] torch.bfloat16, device: cpu [2024-09-07 17:30:26] [INFO] [2024-09-07 17:30:26] [INFO] epoch 1/16 [2024-09-07 17:30:34] [INFO] 2024-09-07 17:30:34 INFO epoch is incremented. train_util.py:668 [2024-09-07 17:30:34] [INFO] current_epoch: 0, epoch: 1 [2024-09-07 17:30:34] [INFO] 2024-09-07 17:30:34 INFO epoch is incremented. train_util.py:668 [2024-09-07 17:30:34] [INFO] current_epoch: 0, epoch: 1 [2024-09-07 17:30:43] [INFO] F:\fluxgym\env\Lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [2024-09-07 17:30:43] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

afrofail commented 1 week ago

After a "git pull" on the latest update, a few more commands lines have added to this error, hopefully this helps.

[2024-09-07 18:12:47] [INFO] steps: 0%| | 0/3200 [00:00<?, ?it/s]2024-09-07 18:12:47 INFO unet dtype: train_network.py:1046 [2024-09-07 18:12:47] [INFO] torch.float8_e4m3fn, device: [2024-09-07 18:12:47] [INFO] cuda:0 [2024-09-07 18:12:47] [INFO] INFO text_encoder [0] dtype: train_network.py:1052 [2024-09-07 18:12:47] [INFO] torch.float8_e4m3fn, device: [2024-09-07 18:12:47] [INFO] cuda:0 [2024-09-07 18:12:47] [INFO] INFO text_encoder [1] dtype: train_network.py:1052 [2024-09-07 18:12:47] [INFO] torch.bfloat16, device: cpu [2024-09-07 18:12:47] [INFO] [2024-09-07 18:12:47] [INFO] epoch 1/16 [2024-09-07 18:12:50] [INFO] F:\fluxgym\env\Lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. [2024-09-07 18:12:50] [INFO] torch.utils._pytree._register_pytree_node( [2024-09-07 18:12:50] [INFO] F:\fluxgym\env\Lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. [2024-09-07 18:12:50] [INFO] torch.utils._pytree._register_pytree_node( [2024-09-07 18:12:54] [INFO] F:\fluxgym\env\Lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. [2024-09-07 18:12:54] [INFO] torch.utils._pytree._register_pytree_node( [2024-09-07 18:12:54] [INFO] F:\fluxgym\env\Lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. [2024-09-07 18:12:54] [INFO] torch.utils._pytree._register_pytree_node( [2024-09-07 18:12:54] [INFO] 2024-09-07 18:12:54 INFO epoch is incremented. train_util.py:668 [2024-09-07 18:12:54] [INFO] current_epoch: 0, epoch: 1 [2024-09-07 18:12:54] [INFO] 2024-09-07 18:12:54 INFO epoch is incremented. train_util.py:668 [2024-09-07 18:12:54] [INFO] current_epoch: 0, epoch: 1 [2024-09-07 18:13:03] [INFO] F:\fluxgym\env\Lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [2024-09-07 18:13:03] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]