bmaltais / kohya_ss

Apache License 2.0
9.63k stars 1.24k forks source link

SDXL Lora training Extremely slow on Rtx 4090 #1288

Closed etha302 closed 1 year ago

etha302 commented 1 year ago

As the title says, training lora for sdxl on 4090 is painfully slow. It needs at least 15-20 seconds to complete 1 single step, so it is impossible to train. i dont know whether i am doing something wrong, but here are screenshot of my settings. Also it is using full 24gb of ram, but it is so slow that even gpu fans are not spinning. 4 3 1 2

underrealSKY commented 1 year ago

@underrealSKY follow my guide

people are able to get 1-1.5 s / it

https://youtu.be/sBFGitIvD2A

sounds like fun afternoon, thanks :)

sepro commented 1 year ago

Reduce the network dim to 128. That was the issue when I tried to train a Lora with 16Gb

On Sat, Aug 26, 2023, 15:05 underrealSKY @.***> wrote:

@underrealSKY https://github.com/underrealSKY follow my guide

people are able to get 1-1.5 s / it

https://youtu.be/sBFGitIvD2A

sounds like fun afternoon, thanks :)

— Reply to this email directly, view it on GitHub https://github.com/bmaltais/kohya_ss/issues/1288#issuecomment-1694337026, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR63ZCUZGMCZFZTQCXQQBTXXHYDLANCNFSM6AAAAAA24BMN6U . You are receiving this because you are subscribed to this thread.Message ID: @.***>

FurkanGozukara commented 1 year ago

free kaggle account speed

image

underrealSKY commented 1 year ago

Reduce the network dim to 128. That was the issue when I tried to train a Lora with 16Gb On Sat, Aug 26, 2023, 15:05 underrealSKY @.> wrote: @underrealSKY https://github.com/underrealSKY follow my guide people are able to get 1-1.5 s / it https://youtu.be/sBFGitIvD2A sounds like fun afternoon, thanks :) — Reply to this email directly, view it on GitHub <#1288 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR63ZCUZGMCZFZTQCXQQBTXXHYDLANCNFSM6AAAAAA24BMN6U . You are receiving this because you are subscribed to this thread.Message ID: @.>

This is actually quite interesting case. I lowered it to 20, and also filled all the other settings as shown by Furkan in his video. I also even tried low VRAM json, but still one iteration was between 50-90 sek. but I also had loaded another console with A1111 webservice initiated, but I was not generating anything. I closed it now i rerun everything (network dim = 20 and rest according to Furkan) and now my outcome is: | 191/5200 [04:02<1:45:54, 1.27s/it, loss=0.114]

bobvanderlinden commented 1 year ago

Thanks @synystersocks that worked perfectly. It might be nice to have this as a 'default' SDXL preset. Currently it's unclear what preset to use and I think the one I was using last time was too heavy (or unoptimized) for my machine.

139/2640 [04:55<1:28:28,  2.12s/it, loss=0.119]

On a RTX 4070.

bolli20000 commented 1 year ago

Hi together, my 4090 currently needs ca. 160s/it. Change of Lora Parameters did not help. Such a shame. Is there already a solution around ? Downgrade Nvidia driver to which older version ?

Screenshot 2023-09-04 110648 Screenshot 2023-09-04 122046

Thanks in advance, best regards, bolli.

DarkAlchy commented 1 year ago

I just had this happen to me, but not to your extent. I just did a Dreambooth training yesterday, so when I did a new one with the same settings to see what I did that I did a facepalm. I had just updated the drivers to the latest, and that is a no-no. Go back to 531.61 or 531.79, but nothing newer. Am I where I should be compared to a 3090? Nope, but I am not going to spend days and instead about 45m.

bolli20000 commented 1 year ago

4090 Big Improvement now: 5 sec/it. I changed additional Parameters to --network_train_unet_only and checked: Gradient checkpointing to true.

Hi together, my 4090 currently needs ca. 160s/it. Change of Lora Parameters did not help. Such a shame. Is there already a solution around ? Downgrade Nvidia driver to which older version ?

etha302 commented 1 year ago

4090 Big Improvement now: 5 sec/it. I changed additional Parameters to --network_train_unet_only and checked: Gradient checkpointing to true.

Hi together, my 4090 currently needs ca. 160s/it. Change of Lora Parameters did not help. Such a shame. Is there already a solution around ? Downgrade Nvidia driver to which older version ?

This is still laughably slow for a 4090! I managed to get 1it/s on friends 3090, so idk what is going on with 40 series cards, drivers or whatever. And why nobody is doing nothing about it..

DarkAlchy commented 1 year ago

4090 Big Improvement now: 5 sec/it. I changed additional Parameters to --network_train_unet_only and checked: Gradient checkpointing to true.

Hi together, my 4090 currently needs ca. 160s/it. Change of Lora Parameters did not help. Such a shame. Is there already a solution around ? Downgrade Nvidia driver to which older version ?

Sounds like you are on Windows. Change your Nvidia drivers to no later than 531.79 (I am not 100% sure that is the last one since Linux uses 535 but I have the best results from 53x on Windows) and that should get to half of that.

DarkAlchy commented 1 year ago

4090 Big Improvement now: 5 sec/it. I changed additional Parameters to --network_train_unet_only and checked: Gradient checkpointing to true.

Hi together, my 4090 currently needs ca. 160s/it. Change of Lora Parameters did not help. Such a shame. Is there already a solution around ? Downgrade Nvidia driver to which older version ?

This is still laughably slow for a 4090! I managed to get 1it/s on friends 3090, so idk what is going on with 40 series cards, drivers or whatever. And why nobody is doing nothing about it..

Agreed, and it pisses me the eff off too. 1700 USD and no speed increase, or slower, is BS. Hardware wise for training a 4090 should be 57% faster than a 3090. That is flat out facts. I know Nvidia is a lot to blame with their current drivers they say they are going to fix, but honestly I think there is more to this than just that.

FurkanGozukara commented 1 year ago

4090 Big Improvement now: 5 sec/it. I changed additional Parameters to --network_train_unet_only and checked: Gradient checkpointing to true.

Hi together, my 4090 currently needs ca. 160s/it. Change of Lora Parameters did not help. Such a shame. Is there already a solution around ? Downgrade Nvidia driver to which older version ?

This is still laughably slow for a 4090! I managed to get 1it/s on friends 3090, so idk what is going on with 40 series cards, drivers or whatever. And why nobody is doing nothing about it..

Agreed, and it pisses me the eff off too. 1700 USD and no speed increase, or slower, is BS. Hardware wise for training a 4090 should be 57% faster than a 3090. That is flat out facts. I know Nvidia is a lot to blame with their current drivers they say they are going to fix, but honestly I think there is more to this than just that.

I think it is about Nvidia

I get 1 it / s with rtx 3090

Even faster than that with rtx 3090 Ti

by the way i can connect RTX 4090 owners PC and try to speed up generation speed - only for my gold and above patreon subscribers

DarkAlchy commented 1 year ago

I think it is about Nvidia

Python is part to blame as well, and Nvidia I have no doubt are super-intelligent to not make stupid mistakes in their drivers for this for just ADA based cards. I think once Hopper sales slow down they will magically fix it so we can get that 57% speed increase over a 3090.

synystersocks commented 1 year ago

Hi i have some additions that may be helpful :D.

when training a lora of myself i found i have semi unique facial features, i call this too sexy for sd lmao, multiple guides tell about tokens for training, one of the sd guys mentioned about 3 letter unique words having a specific effect on the model, sws, i also saw many were using the celebrity tag to link data to the lora. then the use of regulatory images.

so after testing i found i only share about 30% of my facial features, the celebrity token method held the data correctly but attached it to the celebritys facial features, i didnt look so sexy :P

if you get a high match rate, 50%+ the celebrity tag linking may be a viable method for decent likeness.

for myself after testing i found, using the sws tag helps keep your data abstracted and away from any classes, like woman, man, person, this is good to keep the token clean,

then specifying the type in the folder structure, ie - "20_sws man" (30images in there) using unsplash i gathered about 200 photos of men by them selves, (resolution doesnt matter so much, as long as both values arnt more than 3072px, 4000 x 4000 = not okay, 3072 x 6144 = okay "i dont know why") i added these to a folded ie - "1_man"

this worked the best and took 2 hours at 2.23s/it for about 5000 steps. 32 dim x 1 alpha, gamma snr 1, lr 0.0003, constatant. i got really decent results, way better than before,

i also found loss rate value doesnt matter so much, its the variation from 1 value to another that shows the stability of learning, ie with the settings above at first i got 0.0648 loss, for about 30 secs while the learning started then it stablized to around 0.0550, going as low as 0.0530 and as high as 0.0568,

controling the loss rate fluctuations is dependant on network dim, alpha, snr gamma, and all 3 learning rates, and this is also based on the training data set, photos of myself had different loss rates than my girlfriends at the same settings, at 32 dim x 1 alpha, snr gamma = 1, i would test the learning rates between 0.0004 and 0.0001, boot up the training and wait and see how much it fluctuates, then modify the lr settings till i found a stable value, if your loss is fluctuating more than 0.01 of the value, ie from 0.05 to 0.06 then to 0.04, thats too much loss fluctuation and the model is learning too many things, that means the precision for a person will reduce as some of the data is going to the background,

this is also why the regulization images help, the sws keeps your new data clean aswell as having your defined class, then the regs are also a specifyer for class, increasing the weight towards that class, by telling it the class and the images being of that class.

i will continue to test higher dim and alpha's

to help with windows training and the rtx 4090, i think several things are causing issues, 1st - disable hardware accelaration, under high cpu loads your cpu uses your gpu to process cpu based tasks, 2nd - restart your pc, 3rd - dont run training via the gui, this causes some issues (i dont know why - (seems more windows 11 based tho))

insted use koyha to setup your dataset, get your lossrate fluctuations to a more stable rate by starting and stopping the training, when you have everything ready and setup to go, copy the instructions sent to the cmd when you press start, it should look like this,

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/Ai/ComfyUI_windows_portable/ComfyUI/models/checkpoints/sd_xl_base_1.0.safetensors" --train_data_dir="C:/Ai/LoraTraining/LoraTraining/ESet/V2/img" --reg_datadir="C:/Users/nath/Pictures/AiDataSets/RegImgs/Woman/base/reg" --resolution="1024,1024" --output_dir="C:/Ai/LoraTraining/LoraTraining/ESet/V2/model" --logging_dir="C:/Ai/LoraTraining/LoraTraining/ESet/V2/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=32 --output_name="sws" --lr_scheduler_num_cycles="6" --cache_text_encoder_outputs --no_half_vae --full_bf16 --learning_rate="0.0003" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5760" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="0" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --keep_tokens="1" --bucket_reso_steps=64 --min_snr_gamma=1 --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --adaptive_noise_scale=0.00357 --network_train_unet_only

save into notepad first, wordpad messes the format, then carefully remove the extra lines till it looks like the above example, when thats done continue the below steps,

1st, go to youe kohya_ss directory, then to venv, and into scripts, 2nd, inside the scripts folder, press the address bar and type cmd, then press enter, this will open cmd in that directory, 3rd, in cmd type activate and press enter, 4th, in cmd type cd.. then press enter (cd with 2 dots - cd..) 5th, do it again, cd.. and enter, then you should be in the kohya_ss main directory, 6th, paste the instructions you grabbed earlyer into the cmd and press enter.

that will start the training off and hopefully give better s/it or it/s than previously, those settings i posted above are what i now use, i no longer use the memory effiecent attention, this can drop performance a lot.

i hope this info helps :D

nural networks are like a human brain, we link concepts tegether to form memory, ie if i mention the story of "the boy who cried wolf" those that know this story understand its a powerful analogy of the concept and consiquences of lieing, in this regard the title of the story is the token, and if you have the data in your dataset, you will remember the whole story and the reasoning/meaning to it, that meaning is that data the the story wanted you to learn, the story is just a way to translate a concept to a communication of data to relay that defines the concept, the data isnt the concept itself but is the data needed to learn the concept and the name of the story is the token.

Thank you to anyone that read all of this and im so sorry its so long :D

bolli20000 commented 1 year ago

Thank you very much for the many tips, I will try out with the next Training-Session....

Hi i have some additions that may be helpful :D.

bolli20000 commented 1 year ago

@synystersocks Thank you, great performance improvement: now near 1 sec/it. From 200 sec to 1 second is a big win. I still ask what are the main reasons for this Improvement, but actually I only enjoy my training...

image

plus my GPU VRam Usage keeps relative low....

image

Will be excited how the quality of the generated Lora, just 12 epochs seems to be low. I set the number of epochs to 37 with max_training_steps of 30.000 and now I have processing times of around 2,3 seconds...

Ok, performance is now much better, but the resulting LORA is relatively small with just 166 MB with ca. 2 sec/it, before with my standard-Settings ca. 891 MB with ca. 5 sec/it. No Improvement without loss. How can I come to a bigger LORA with your settings ?

synystersocks commented 1 year ago

increase the rank to increase the size, higher ranks also hold more detail, keep alpha at 1, and test the lr from 4e-4 to 4e-1, see which gives the least loss fluctuations, that should increase size and precision, id try rank 64. alpha 1 first, then bump it up to 128 if needed. i would also recommend less images than more for a person, i find loras made with 15 high quality images outpeforms a lora made with 100 average images

2blackbar commented 1 year ago

What a mess , anyone have one answer to this ? Aint gonna train sdxl for 24 hours on 20 pics, this is horrible.

FurkanGozukara commented 1 year ago

What a mess , anyone have one answer to this ? Aint gonna train sdxl for 24 hours on 20 pics, this is horrible.

i have found a great workflow for SDXL dreambooth training

even without xformers RTX 3090 getting like 1.44 second / it full training

2blackbar commented 1 year ago

I followed your video Furkan, still no luck and im getting this laughable speed on 3090 | 4/5200 [01:32<33:27:40, 23.18s/it, loss=0.164] This is seriously some bad prank level shit and dev is silent, this repo is like this since 6 months or more where most of the time training is broken , i think this is not even using cuda at all and it just runs on cpu no matter what we do. Is non GUI kohya like this as well? I think its time to forget about this repo I installed everything according to install settings on this repo which is running setup bat after git cloning and its not only me, theres probably tons of people who have issues but dont bother to register here and talk about it.

FurkanGozukara commented 1 year ago

I followed your video Furkan, still no luck and im getting this laughable speed on 3090 | 4/5200 [01:32<33:27:40, 23.18s/it, loss=0.164] This is seriously some bad prank level shit and dev is silent, this repo is like this since 6 months or more where most of the time training is broken , i think this is not even using cuda at all and it just runs on cpu no matter what we do. Is non GUI kohya like this as well? I think its time to forget about this repo

i have 3090 ti and i am getting much better speeds. like 1.2 it / s for lora with optimizations

this is very very weird

i also used so many times rtx 3090 on runpod always working great

are you my patreon supporter? i can connect your pc and try to solve

2blackbar commented 1 year ago

Id prefere to learn what happens , you know it would be much better if this repo would share venv volder for windows and other platforms or would be portable version which is like 2 GB but it just works cause this is just too much drama that takes up everyones time . Im really tired of this because this repository is like this for months wheere most of the time training is plagued with issues and there is one day one version that works nice, but what versions that is now i dont know, i have good working one for sd1.5 but not for sdxl cause sdxl is broken here from first version it was introduced. What drivers do you have ?

FurkanGozukara commented 1 year ago

Id prefere to learn what happens , you know it would be much better if this repo would share venv volder for windows and other platforms or would be portable version which is like 2 GB but it just works cause this is just too much drama that takes up everyones time .

sharing of venv folder is really hard

for example if i share mine it wont work on yours

it requires some special installation . i tried once but couldn't make it work :D

2blackbar commented 1 year ago

Why not how come when theres tons of portable versions of python venvs inside and they do work, its self contained , i have my python versuon thats required for auto11 and its not the same thats required for this repo which is bs on its own they should match. Im 100000% sure that my install is training on cpu thats why laughable speeds, im gonna activate venv and test for cuda... on other hand xformers would throw error cause it wont work without cuda .... So there is cuda probably but it just trains on cpu cause why the F not ok cuda is installed so now, why the hell its choosing to train on cpu nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018 Cuda compilation tools, release 10.0, V10.0.130

synystersocks commented 1 year ago

If you think your not training on gpu, check task manager, look at processes, change from 3d to cuda in the view for gpu, not all info is shown on the task manager tab. Kohya_ss is good for setting the values, BUT, run training via cmd, not gui. The main issues is memory - vram, I would recommend an a100 or multi gpu system.

Make sure to disable gpu acceleration, when enabled, your cpu will use your cuda gpu core to accelerate cpu processes. Also requires windows dev mode enabled.

New cpu/gpu architecture is performed, ie I run on a 12700k, 32gb, 2tb m.2, rtx 3060ti,

2 technologies help that are only available with certain hardware, resize bar allows cpu to access gpu vram, and doesn't require the second copy of data in the system ram, and gpu direct memory streaming, allows for data to be transfered from m.2 direct to vram, these help a lot, but the requirements are an Intel cpu with resize bar compatability for resize bar, and an rtx gpu with gpu direct storage.

There is more helpful info in my previous posts, hopefully this can help, also civitai has some fantastic articles too, I would also say to try and learn scientific notation.

All the devs working on this are doing it for the love of tech, please thank them for them work they have done, not for what they haven't yet.

Personally I'm a game developer, not a software developer, i program mainly in c# and hlsl, i program in a load of other lanuages too, incluing python, java, c++, ect. this is an opportunity to learn more, if you find you don't know something, then you know exactly what you need to learn. The devs have been fantastic on this project, so thank you very much for the access to this incredible technology 😀 you all rock!

2blackbar commented 1 year ago

Dont you think this is a bit too much frankensteining random shit just to get sdxl working ? cause sd1.5 works right out the box but when i train sdxl it slows down a lot and in official articles and all sdxl is praised by stability as the one thats so easy to train... apparently its not with current toolset. I think i will just write my own gui for kohya scripts and abandon this one , its strange to have so many issues for so many months on this repo. Is 3 iterations pers second normal on 3090 cause i was able to get it to this speed and ia see on yt tutorials that others have that as well, i changed python version to the one required.3.10.9 i think and drivers for nvidia 512.86 or 68

DarkAlchy commented 1 year ago

Dont you think this is a bit too much frankensteining random shit just to get sdxl working ? cause sd1.5 works right out the box but when i train sdxl it slows down a lot and in official articles and all sdxl is praised by stability as the one thats so easy to train... apparently its not with current toolset. I think i will just write my own gui for kohya scripts and abandon this one , its strange to have so many issues for so many months on this repo. Is 3 iterations pers second normal on 3090 cause i was able to get it to this speed and ia see on yt tutorials that others have that as well, i changed python version to the one required.3.10.9 i think and drivers for nvidia 512.86 or 68

People, people. First of all someone on this thread needs to stop being a used car salesman. Don't offer the community if there are strings attached, as in Patreon only. That is really getting a lot of people out here annoyed at you beyond belief. You do you just realize it is pretty sad.

Now the slowdowns this thread was talking about, and I showed, is due to issues from Nvidia (for instance I can Dreambooth BS12 on my 4090 in Linux but can't do more than BS1, or 2, in Windows. No idea why) and some from Kohya I showed using the same 2.1. It proved Kohya has an issue no matter who, or what, tries to sugar coat it.

The issue I think @2blackbar you might be talking about is the fact that 1.5 vs XL then XL is four, yes 4, times the data as 1.5 as that is a lot of manipulations that must be done so that slowdown is understandable to happen (look out for SD 3 with 2048x2048 if they still go through with it), but when my 4090 is hardware 57% faster than a 3090 and I am slower, to, at most, the same speed, we have issues beyond the size of data being manipulated and why this thread was created. Not sure why anyone who has a 4090 gives one darn about tricks for a 3090 when this thread clearly shows 4090 in the title. If you, or anyone I mean, do a 3090 trick then that same trick should be 35-57% faster on a 4090.

MMaster commented 1 year ago

Not sure if this will help, if its better than what you see as I don't have 3090 to compare it with, but I was getting similar results as mentioned here on my 4090 so I started experimenting a bit.

I've got about 200 training images ranging from 1024x1024 to 4k. Initially my final number of steps with batch size 1 would be around 21k steps and it would take 13-16 hours at speeds about 2-3s/it. Now I'm running with batch size 8 at 3s/it and it will finish in 2.5 hours.

I think the main thing was: gradient checkpointing & caching latents to disk

During caching latents to disk step I noticed it was able to do 10+it/s but used VRAM was slowly climbing until it got to ~23.5 GB when it slowed down to crawl (again ~2-3s/it) while the GPU was not heating up and even fans were almost always idle even tho it reported 100% usage. Monitoring the system I realized its because the data is being transferred between GPU and CPU and python process was using one full CPU thread therefore the CPU was the bottleneck.

When I canceled the process with CTRL+C and restarted it - it continued caching the latents for the remaining images super fast until the slow down happened again (still during caching latents not training yet). So I restarted it again - had to do the restart ~4 times until all the images had latents cached on disk.

After this I got to about 1.4s/it with batch size 1, but I saw that the GPU is not getting hot and python was again using 100% of 1 CPU thread and also the GPU was using about 14 GB of VRAM which lead me to believe the bottleneck is that single python CPU thread so I started increasing batch size until I got to 8 which seems to be the sweet spot of using most of the VRAM and not getting limited by CPU (getting it to about 3.2s/it but total time reduced from 13-16 hours to 2.5 hours).

Again I'm not sure if this is better or worse than what you can get on 3090 & I know BS8 will not have same results as BS1, but I just thought I would share my findings.

FurkanGozukara commented 1 year ago

@MMaster an important reminder

if you use higher batch size you will get lesser generalization

especially you will get much lesser quality results when you train a person or anything with lesser number of images

higher batch size also requires more optimal learning rate

MMaster commented 1 year ago

@FurkanGozukara Thank you for that addition - yeah I know that. My experiment & wall of text was more about getting the most out of the GPU instead of getting the best training results. Maybe the way I got to it will trigger some eureka moment for someone as to why 4090 can be slower than 3090 & maybe not. Just wanted to share.

MMaster commented 1 year ago

@FurkanGozukara But from what I see in this discussion you were getting 1.23s/it on 3090 TI with 13900k.

I've tested it again with batch size 1 even without gradient checkpointing I'm getting ~1.7s/it with 10900k clocked at 5Ghz after I let it cache latents on disk (restarting it while caching latents when VRAM gets full) and 1.05s/it with --network_train_unet_only - in both cases the GPU is underutilized going between 26% - 46% used while the python process is using 100% of single CPU core/thread.

To me it looks like there are two different issues with performance:

  1. issue with caching latents that are not cached on disk (or not using caching latents on disk) when VRAM gets 100% used and it is really slow and GPU looks 100% utilized even tho its not heating up but python process is using 100% of its single CPU thread (doing about 2-3s/it)
  2. issue when latents are already cached on disk and VRAM is not fully used and GPU is used at only ~30-40%, but python process using 100% of single CPU thread (doing about 1.7s/it with normal training and 1it/s with unet only training)

Maybe I'm wrong, but:

  1. issue seems to be some kind of video memory leak during caching latents causing a lot of traffic between CPU and GPU when VRAM is almost full
  2. issue seems to be GPU being so fast that that single CPU thread running python is not able to push work to it fast enough
FurkanGozukara commented 1 year ago

I am doing a SDXL DreamBooth training atm - full training both text encoder and unet

1024x1024 no xformers - I don't use xformers since it degrades quality gradient checkpoint enabled

during caching latents (included on disk) it used max 4.3 GB VRAM - no leak

training speed is around 1.52 second / it

my SDXL LoRA training speed is over 1 it / s atm with xformers and no gradient checkpoint. probably can reach about 1.2 it / s

here screenshot from my current DreamBooth - hopefully a tutorial coming

image

etha302 commented 1 year ago

I am doing a SDXL DreamBooth training atm - full training both text encoder and unet

1024x1024 no xformers - I don't use xformers since it degrades quality gradient checkpoint enabled

during caching latents (included on disk) it used max 4.3 GB VRAM - no leak

training speed is around 1.55 second / it - it may get as low as 1.5 second / it

my SDXL LoRA training speed is over 1 it / s atm with xformers and no gradient checkpoint. probably can reach about 1.2 it / s

here screenshot from my current DreamBooth - hopefully a tutorial coming

image

How do you know it also trains unet and text encoder, since there is no option in gui? On 4090 i get around 1.4s/it on dreambooth and 1.5s/it for lora. So again 3090 is faster

FurkanGozukara commented 1 year ago

I am doing a SDXL DreamBooth training atm - full training both text encoder and unet 1024x1024 no xformers - I don't use xformers since it degrades quality gradient checkpoint enabled during caching latents (included on disk) it used max 4.3 GB VRAM - no leak training speed is around 1.55 second / it - it may get as low as 1.5 second / it my SDXL LoRA training speed is over 1 it / s atm with xformers and no gradient checkpoint. probably can reach about 1.2 it / s here screenshot from my current DreamBooth - hopefully a tutorial coming image

How do you know it also trains unet and text encoder, since there is no option in gui? On 4090 i get around 1.4s/it on dreambooth and 1.5s/it for lora. So again 3090 is faster

it is enabled by default

also you need to compare with gradient checkpoint enabled and xformers disabled for dreambooth

and gradient checkpoint disabled and xformers enabled for lora to compare with my speeds

etha302 commented 1 year ago

I am doing a SDXL DreamBooth training atm - full training both text encoder and unet 1024x1024 no xformers - I don't use xformers since it degrades quality gradient checkpoint enabled during caching latents (included on disk) it used max 4.3 GB VRAM - no leak training speed is around 1.55 second / it - it may get as low as 1.5 second / it my SDXL LoRA training speed is over 1 it / s atm with xformers and no gradient checkpoint. probably can reach about 1.2 it / s here screenshot from my current DreamBooth - hopefully a tutorial coming image

How do you know it also trains unet and text encoder, since there is no option in gui? On 4090 i get around 1.4s/it on dreambooth and 1.5s/it for lora. So again 3090 is faster

it is enabled by default

also you need to compare with gradient checkpoint enabled and xformers disabled for dreambooth

and gradient checkpoint disabled and xformers enabled for lora to compare with my speeds

I know i am talking about absolute best speed i can achieve on 4090. Also with using your .json files from patreon, so whatever is going on. It doesn’t use half the card, i mean the max temp 4090 reaches is 54 degrees for me and fans barely spinning, using around 20gb of vram

FurkanGozukara commented 1 year ago

@etha302 how much GPU usage you see on task manager?

I am seeing 95%+

MMaster commented 1 year ago

@FurkanGozukara can you check your CPU use of the python process? Is it close to 100% of single CPU thread/core?

Currently it looks to me as combination of: you have better CPU so you can utilize 3090 Ti to 100%. I have worse CPU and better GPU which is why I'm seeing GPU being utilized only between 26 - 46% when getting 1.7s/it with text & unet training and 1.05s/it when doing unet only training while one CPU thread is getting used to 100% by python.

etha302 commented 1 year ago

@etha302 how much GPU usage you see on task manager?

I am seeing 95%+

It is flactuating allot it goes to 90 or over then to 30 then again back to 90+, it always does that no matter what settings i use. cpu: ryzen 5900x Gpu: rtx 4090 Memory:64gb Just so people know the specs

FurkanGozukara commented 1 year ago

@FurkanGozukara can you check your CPU use of the python process? Is it close to 100% of single CPU thread/core?

Currently it looks to me as combination of: you have better CPU so you can utilize 3090 Ti to 100%. I have worse CPU and better GPU which is why I'm seeing GPU being utilized only between 26 - 46% when getting 1.7s/it with text & unet training and 1.05s/it when doing unet only training while one CPU thread is getting used to 100% by python.

my python are like this - ye probably 1 core is 100%

image

FurkanGozukara commented 1 year ago

@etha302 how much GPU usage you see on task manager? I am seeing 95%+

It is flactuating allot it goes to 90 or over then to 30 then again back to 90+, it always does that no matter what settings i use. cpu: ryzen 5900x Gpu: rtx 4090 Memory:64gb Just so people know the specs

mine never go below 90%

i think it is about shtty NVIDIA drivers

FurkanGozukara commented 1 year ago

here my SDXL DreamBooth training tweet you can follow me there 🗡️ going to sleep now

https://twitter.com/GozukaraFurkan/status/1704318431292469662

MMaster commented 1 year ago

Well it definitely looks like the issue is that the GPU is faster than the single thread of CPU can handle (since that python process is running in single thread). Ryzen 5900x (turbo 4.8Ghz) has lower clock speeds than 13900k (turbo 5.35Ghz) and is also overall 30% slower. My 10900k clocked at 5GHz (4.7GHz if AVX2 is used) is also slower.

Since even on 13900k you are getting close to 100% use of single core by that python I would guess I would see slower speeds than you even with 3090 Ti. That's why I see the GPU use not going above 46% with batch size 1, because GPU is finishing the work so fast that single CPU thread can't keep up, getting to higher batch sizes solves this as GPU needs to do more work before asking CPU something.

Based on this I would say the solution would be getting the process of python <-> GPU communication use more than one CPU thread which python can't do normally because of Global Interpreter Lock.. apart from other difficulties with implementing such mechanism.

I would still be interested in someone who also has 13900k but with 4090 if they can get same speeds as you.

FurkanGozukara commented 1 year ago

Here my LoRA training at the moment : https://twitter.com/GozukaraFurkan/status/1704447704783261716 xformers enabled gradient checkpoint disabled

1.12 s/it

etha302 commented 1 year ago

Here my LoRA training at the moment : https://twitter.com/GozukaraFurkan/status/1704447704783261716 xformers enabled gradient checkpoint disabled

1.12 s/it

Would it be possible to train lora without xformers as well?

MMaster commented 1 year ago

image

Lora training on 4090 Batch size: 1 UNET only xformers enabled gradient checkpoint disabled

GPU used at 9% doing 1.03s/it while kohya python using 100% of single CPU thread (on 20 thread CPU clocked at 4.7GHz during run).

etha302 commented 1 year ago

image

Lora training on 4090 Batch size: 1 UNET only xformers enabled gradient checkpoint disabled

GPU used at 9% doing 1.03s/it while kohya python using 100% of single CPU thread (on 20 thread CPU clocked at 4.7GHz during run).

This is beyond weird 9%?? I do get gpu usage up to 90+ But it is flactuating between 30 and then goes up to 90 again and so on. And speed is around 1.5s/it

MMaster commented 1 year ago

Yep but that is what I see. The GPU is completely quiet sitting at 49'C even turning off fans from time to time (I'm even connected to it remotely that's why there is Video Encoding being done). This is why I think it is limited by CPU because I can get the GPU to do work by increasing the batch size to 8.

But I'm using older drivers from 23-Jun-2023, I will try to update those but I don't expect too much change.

Also here is the full train command line: accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="X:/_AIGen/stable-diffusion-webui/models/Stable-diffusion/sdxl/ sd_xl_base_1.0.safetensors" --train_data_dir="T:/_AITraining/traindata/mmrdtslm_r1/image" --reg_data_dir="T:/_AITraining/traindata/mmrdtslm_r1/reg" --resolution="1024,1024" --output_dir="T:/_AITraining/traindata/mmrdtslm_r1/model" --logging_dir="T:/_AITraining/traindata/mmrdtslm_r1/log" --network_alpha="4" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=64 --output_name="mmrdtslm_r1_bs1" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0003" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="21700" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.0 --network_train_unet_only --sample_sampler=euler --sample_prompts="T:/_AITraining/traindata/mmrdtslm_r1/model\sample\prompt.txt" --sample_every_n_steps="500"

etha302 commented 1 year ago

Yep but that is what I see. The GPU is completely quiet sitting at 49'C even turning off fans from time to time (I'm even connected to it remotely that's why there is Video Encoding being done). This is why I think it is limited by CPU because I can get the GPU to do work by increasing the batch size to 8.

But I'm using older drivers from 23-Jun-2023, I will try to update those but I don't expect too much change.

My gpu reaches around 54, but fans are also barely spinning( but they don’t turn off). But my usage fluctuate allot as i said. Considering 9% usage, you get way better speeds i do. Really weird, i tried every driver i could download, and nothing changed currently running 531.79 i think? Not sure though i know they are pretty old Edit: will post some pictures later when im home

FurkanGozukara commented 1 year ago

Here my LoRA training at the moment : https://twitter.com/GozukaraFurkan/status/1704447704783261716 xformers enabled gradient checkpoint disabled 1.12 s/it

Would it be possible to train lora without xformers as well?

yes with gradient checkpointing it works but 1.75 s / it

MMaster commented 1 year ago

Yep but that is what I see. The GPU is completely quiet sitting at 49'C even turning off fans from time to time (I'm even connected to it remotely that's why there is Video Encoding being done). This is why I think it is limited by CPU because I can get the GPU to do work by increasing the batch size to 8.

But I'm using older drivers from 23-Jun-2023, I will try to update those but I don't expect too much change.

Also here is the full train command line: accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="X:/_AIGen/stable-diffusion-webui/models/Stable-diffusion/sdxl/ sd_xl_base_1.0.safetensors" --train_data_dir="T:/_AITraining/traindata/mmrdtslm_r1/image" --reg_data_dir="T:/_AITraining/traindata/mmrdtslm_r1/reg" --resolution="1024,1024" --output_dir="T:/_AITraining/traindata/mmrdtslm_r1/model" --logging_dir="T:/_AITraining/traindata/mmrdtslm_r1/log" --network_alpha="4" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=64 --output_name="mmrdtslm_r1_bs1" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0003" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="21700" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.0 --network_train_unet_only --sample_sampler=euler --sample_prompts="T:/_AITraining/traindata/mmrdtslm_r1/model\sample\prompt.txt" --sample_every_n_steps="500"

Updated drivers to the latest ones 537.34 and there is no change. I've just noticed that Windows Performance Monitor is wrong as HWINFO shows GPU core load going between (23-50%) while Performance Monitor shows 6% use. image

Anyway still looks like the python process is not able to utilize the GPU fully because it is hitting that single CPU core it uses at 100%.