cocktailpeanut / fluxgym

Dead simple FLUX LoRA training UI with LOW VRAM support
1.25k stars 104 forks source link

Supported Configurations for Dual Graphics Card (Dual GPU) Hosts #98

Open Endygao25 opened 1 month ago

Endygao25 commented 1 month ago

Hey bro~ I'm sorry that my problem is a bit specific, because the local machine is a dual 4090 video card machine, so after I successfully installed and deployed fluxgym, I reported an error after debugging the relevant parameters and clicking start training, which according to the prompt I think is due to the recognition of more GPUs. I also consulted chatGPT for these issues, but since I am a code noob, I really can't complete the related debugging by myself. The feedback from the backend is as follows:

[2024-09-17 14:55:15] [INFO] Running F:\pinokio\api\fluxgym.git\outputs\fog\train.bat [2024-09-17 14:55:15] [INFO] [2024-09-17 14:55:15] [INFO] (env) (base) F:\pinokio\api\fluxgym.git>accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 sd-scripts/flux_train_network.py --pretrained_model_name_or_path "F:\pinokio\api\fluxgym.git\models\unet\flux1-dev.sft" --clip_l "F:\pinokio\api\fluxgym.git\models\clip\clip_l.safetensors" --t5xxl "F:\pinokio\api\fluxgym.git\models\clip\t5xxl_fp16.safetensors" --ae "F:\pinokio\api\fluxgym.git\models\vae\ae.sft" --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 4 --optimizer_type adamw8bit --sample_prompts="F:\pinokio\api\fluxgym.git\outputs\fog\sample_prompts.txt" --sample_every_n_steps="400" --learning_rate 8e-4 --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 16 --save_every_n_epochs 4 --dataset_config "F:\pinokio\api\fluxgym.git\outputs\fog\dataset.toml" --output_dir "F:\pinokio\api\fluxgym.git\outputs\fog" --output_name fog --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1 --loss_type l2 [2024-09-17 14:55:18] [INFO] The following values were not passed to accelerate launch and had defaults used instead: [2024-09-17 14:55:18] [INFO] --num_processes was set to a value of 2 [2024-09-17 14:55:18] [INFO] More than one GPU was found, enabling multi-GPU training. [2024-09-17 14:55:18] [INFO] If this was unintended please pass in --num_processes=1. [2024-09-17 14:55:18] [INFO] --num_machines was set to a value of 1 [2024-09-17 14:55:18] [INFO] --dynamo_backend was set to a value of 'no' [2024-09-17 14:55:18] [INFO] To avoid this warning pass in values for each of the problematic parameters or run accelerate config. [2024-09-17 14:55:18] [INFO] W0917 14:55:18.961000 20576 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-09-17 14:55:21] [INFO] Traceback (most recent call last): [2024-09-17 14:55:21] [INFO] File "F:\pinokio\bin\miniconda\lib\runpy.py", line 196, in _run_module_as_main [2024-09-17 14:55:21] [INFO] return _run_code(code, main_globals, None, [2024-09-17 14:55:21] [INFO] File "F:\pinokio\bin\miniconda\lib\runpy.py", line 86, in _run_code [2024-09-17 14:55:21] [INFO] exec(code, run_globals) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\Scripts\accelerate.exe__main.py", line 7, in [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main [2024-09-17 14:55:21] [INFO] args.func(args) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\launch.py", line 1097, in launch_command [2024-09-17 14:55:21] [INFO] multi_gpu_launcher(args) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\launch.py", line 734, in multi_gpu_launcher [2024-09-17 14:55:21] [INFO] distrib_run.run(args) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\run.py", line 910, in run [2024-09-17 14:55:21] [INFO] elastic_launch( [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\launcher\api.py", line 138, in call__ [2024-09-17 14:55:21] [INFO] return launch_agent(self._config, self._entrypoint, list(args)) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\launcher\api.py", line 260, in launch_agent [2024-09-17 14:55:21] [INFO] result = agent.run() [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper [2024-09-17 14:55:21] [INFO] result = f(*args, kwargs) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 696, in run [2024-09-17 14:55:21] [INFO] result = self._invoke_run(role) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 849, in _invoke_run [2024-09-17 14:55:21] [INFO] self._initialize_workers(self._worker_group) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper [2024-09-17 14:55:21] [INFO] result = f(*args, *kwargs) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 668, in _initialize_workers [2024-09-17 14:55:21] [INFO] self._rendezvous(worker_group) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 137, in wrapper [2024-09-17 14:55:21] [INFO] result = f(args, kwargs) [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 500, in _rendezvous [2024-09-17 14:55:21] [INFO] rdzv_info = spec.rdzv_handler.next_rendezvous() [2024-09-17 14:55:21] [INFO] File "F:\pinokio\api\fluxgym.git\env\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 67, in next_rendezvous [2024-09-17 14:55:21] [INFO] self._store = TCPStore( # type: ignore[call-arg] [2024-09-17 14:55:21] [INFO] RuntimeError: use_libuv was requested but PyTorch was build without libuv support [2024-09-17 14:55:21] [ERROR] Command exited with code 1 [2024-09-17 14:55:21] [INFO] Runner:

cousined1 commented 1 month ago

I have the same issue running both 4070 and 4060 ti. It worked yesterday, since the update I am no longer able to apply fix from previously Closed issue: "Thanks to @morpheus on Discord, we solved this. It's specific to systems with dual GPUs. In order for the training to run on a multi-GPU system on Windows, click on the ^ after the first accelerate launch, shift-enter and add

--num_processes=1^

Before the line about mixed_precision bf16. Training should now run."

idouglas11 commented 1 month ago

I'm also experiencing this issue, the train script field is not editable so cannot use the num_processes=1 fix

Juqowel commented 1 month ago

You can try special launch method. Example (for windows):

env\Scripts\activate
set CUDA_VISIBLE_DEVICES=0 & python app.py

Edit: I guess it's not relevant after the last update. But you might need to specify the gpu number in start.js now.