Training Ultra Slow with more than 1 GPU - very likely affecting all users with more than 1 GPU

SteVoit commented 6 months ago

Hi there, i am running on 4x RTX4090 and as soon as i use more than 1 GPU the training gery super slow with the newer scripts starting from 22.x.x

i think that there is a general problem in kohya with multi gpu i tested 3 version: 24.x.x. 22.x.x. and 21.x.x. same machine same cuda Version (11.8) same GPU (536.23) driver same gpus (4x4090) I have also tested different driver versions (520 as suggested by cuda 11.8, 535 and 550 in ubuntu) on windoes i have tested with different versions also (including more recent with disabled Fallback to System RAM) kohya 21 takes 1:50, kohya 22 and 24 take ~28:00 its like 15 times slower

i tested under ubuntu22.04, ubuntu20.04 and windows 10. no matter what i do i can not get the speeds back to the speeds of 21 i tested also with cuda12 and cuda11 - same issue anyone got any ideas?

i even tested on 2 systems, one with 4 gpus and one with 2 gus, one intel one amd tested gloo and nccl so im quite sure that everybody will run into this issue if they use more than 1 gpu

i have back ported current version 24. to the same requirements than 21 (torch 2.0.1, and so on) where i get the good speed but no luck

what is also confusing me a lot is that caching talents takes about 10x longer on the 22 and 24 kohya

kohya_21 kohya_21_1 kohya_22 kohya_22_1 kohya_24 kohya_24_1

bmaltais commented 6 months ago

The sd-scripts from kohya have been updated a lot recently... The version used in v21.x is probably the reason... One thing you could try is to install v24.x but manually change the sd-scripts version by doing:

cd sd-scripts
git checkout v0.8.3
cd ..

and then run the GUI... Might bring you back to a speedy state under sd-scripts v0.8.3 release

SteVoit commented 6 months ago

Thanks for the feedback, i have tested this and apparently only sd scripts version > 8 seem to work with the webui however the version in 21.8 seem to be 0.6 or 0.7 which didnt work with the webui V24 so without reworking the webui i doubt that this is really a solution

FYI the issue was introduced from Version 22.3.1 -> 22.4.0 22.3.1 works just fine 22.4.0 is really slow mabye that helps getting down to the core issue?

any other ideas?

bmaltais commented 6 months ago

kohya introduced significant change to multi-gpu from one version to the next... so it is quite possible it broke something.

Did you try to set the accelerate parameters for multi-gpu under the v24 Accelerate accordion? This is where you configure multi-gpu for the training:

SteVoit commented 6 months ago

yes i set it to bf16 4 processes 1 machine, enabled multi gpu and listed the 4 gpus: 0,1,2,3 but same result in speed. thats kinda how i started

also in multi gpu there is an issue in the script: https://github.com/kohya-ss/sd-scripts/blob/bfb352bc433326a77aca3124248331eb60c49e8c/library/train_util.py Line 4427 encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) to encoder_hidden_states = text_encoder.module.text_model.final_layer_norm(encoder_hidden_states)

however for some reason in single batch the line above works

but that is just a side thingy

i kind of spent the last week or 2 to resolve the speed issue locally with different versions of cuda/driver and so on but by now im quite sure its a script issue

it feels like it got broken when multi gpu UI got introduced and moved away from accelerator config

im happy to test whatever you want if it helps solving this.

bmaltais commented 6 months ago

I think you need to discuss this directly in the sd-scripts authos repo. He is the one who can help you with this problem at this point. The good thing is that v24 use a toml config file that kohya will appreciate along with the run command to load the toml... including the accelerate parameters... so this should help him troubleshoot this issue... or others in his repo:

https://github.com/kohya-ss/sd-scripts/issues

bmaltais commented 6 months ago

Another possibility is that the new version of the accelerate module is causing the difference…

You can also run accelerate config manually to try to set things the way you want… but this is what the GUI essentially does by feeding the parameters at the cli.

SteVoit commented 6 months ago

@bmaltais i have allready tried to port back dependencies (for example i have used 100% identical requirements in 23.x that work in 21/22.3) with no luck. also i did run acceleraor config manually aswell. i really think this is an issue in the scripts itself

littleyeson commented 6 months ago

Hi there, i am running on 4x RTX4090 and as soon as i use more than 1 GPU the training gery super slow with the newer scripts starting from 22.x.x

i think that there is a general problem in kohya with multi gpu i tested 3 version: 24.x.x. 22.x.x. and 21.x.x. same machine same cuda Version (11.8) same GPU (536.23) driver same gpus (4x4090) I have also tested different driver versions (520 as suggested by cuda 11.8, 535 and 550 in ubuntu) on windoes i have tested with different versions also (including more recent with disabled Fallback to System RAM) kohya 21 takes 1:50, kohya 22 and 24 take ~28:00 its like 15 times slower

i tested under ubuntu22.04, ubuntu20.04 and windows 10. no matter what i do i can not get the speeds back to the speeds of 21 i tested also with cuda12 and cuda11 - same issue anyone got any ideas?

i even tested on 2 systems, one with 4 gpus and one with 2 gus, one intel one amd tested gloo and nccl so im quite sure that everybody will run into this issue if they use more than 1 gpu

i have back ported current version 24. to the same requirements than 21 (torch 2.0.1, and so on) where i get the good speed but no luck

what is also confusing me a lot is that caching talents takes about 10x longer on the 22 and 24 kohya

how to use more GPU's training in windows? when cilck Mulit-GPU is enable and set 2 GPU processesI.It always display nccl error. But I search nccl just haven't windows system version.

SteVoit commented 5 months ago

@littleyeson

windows currently does not support NCCL so you need to switch the backend to gloo

modify train_utily.py to: kwargs_handlers = ( None if args.ddp_timeout is None else [InitProcessGroupKwargs(timeout=datetime.timedelta(minutes=args.ddp_timeout))] ) accelerator = Accelerator( gradient_accumulation_steps=args.gradient_accumulation_steps, mixed_precision=args.mixed_precision, log_with=log_with, project_dir=logging_dir, kwargs_handlers=[InitProcessGroupKwargs(backend="gloo")], ) return accelerator

i recommend to use version 23.03 atm of kohya as the speed in the latter with multi gpu is really bad as you can see from the above

littleyeson commented 5 months ago

@littleyeson

windows currently does not support NCCL so you need to switch the backend to gloo

modify train_utily.py to: kwargs_handlers = ( None if args.ddp_timeout is None else [InitProcessGroupKwargs(timeout=datetime.timedelta(minutes=args.ddp_timeout))] ) accelerator = Accelerator( gradient_accumulation_steps=args.gradient_accumulation_steps, mixed_precision=args.mixed_precision, log_with=log_with, project_dir=logging_dir, kwargs_handlers=[InitProcessGroupKwargs(backend="gloo")], ) return accelerator

i recommend to use version 23.03 atm of kohya as the speed in the latter with multi gpu is really bad as you can see from the above

Sorry I can not find train_utily.py. It is not in /sd-scripts or venv.I cannt search this file in kohya_ss

SteVoit commented 5 months ago

@littleyeson https://github.com/kohya-ss/sd-scripts/blob/bfb352bc433326a77aca3124248331eb60c49e8c/library/train_util.py

littleyeson commented 5 months ago

@littleyeson https://github.com/kohya-ss/sd-scripts/blob/bfb352bc433326a77aca3124248331eb60c49e8c/library/train_util.py

thx your reply.But still a problem I try to modify all this. QQ截图20240429210522 there is a error display like this QQ截图20240429210535 and the I only add this line and add only this line at last in this function QQ截图20240429205226 therr is another error. QQ截图20240429205752

Baku-Rue commented 5 months ago

@SteVoit Have you figured anything out on this issue and what was the version you found that still works for training without the slowdown issue you were seeing? I have been using a mutli-gpu setup with two 4090s and running into the same issue when I found this thread. Wanted to try and revert to a version that was not affected if that was even possible. I tried v23.03 but was still seeing the same speeds as the current branch. Thanks,

bmaltais / kohya_ss

Training Ultra Slow with more than 1 GPU - very likely affecting all users with more than 1 GPU #2366