AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
138.69k stars 26.33k forks source link

RTX 4090 performance #2449

Open Bachine opened 1 year ago

Bachine commented 1 year ago

Is the 4090 fully supported in SD?

I am getting the same performance with the 4090 that my 3070 was getting.

cmp-nct commented 1 year ago

In general: PSU problems are typically causing a hard crash. Most good PSUs will shut down and the cheapest ones will likely cause the GPU or CPU to crash when overloaded/undersupplied.

Kosinkadink commented 1 year ago

@Bachine Yes, if I understood you right. You can technically run the pip install command from earlier in this thread from any directory as long as you have that venv activated in your session.

C43H66N12O12S2 commented 1 year ago

@Farfie No, it shouldn't be.

Try downloading cuDNN 8.6 and dropping everything in the bin folder into python installation path\Lib\site-packages\torch\lib

SSCBryce commented 1 year ago

@comp-nect That's how it was in the past. Starting with this ATX3.0 (PCIE 5.0), and this generation of card, they can now communicate with each other and PSU overload crashes should happen far less frequently than in the past. That's my understanding anyway.

@C43H66N12O12S2 Seems like I need a developer membership? You wouldn't happen to have a mirror?

C43H66N12O12S2 commented 1 year ago

Making one should be pretty fast and easy - there are no requirements or anything like that.

I'd upload a copy but it's a large file and I have slow upload. If you can't get a developer account, I'll do it but you should try making a account first.

SSCBryce commented 1 year ago

@C43H66N12O12S2 Alright, made an account and downloaded it.. There's no torch folder in my global Python install, so I put them in torch/lib that I found in the webui subdirectories. Replaced files when asked. Launched from webuser, it stayed blank for awhile (assuming expanding new libraries), and started running. And uh, yep, seems like it made quite a big difference. Did I mess those files up somehow? Also ignore some of those values, I was trying higer resolutions. And the error at the top was trying batch size 4 during training. I would also like to note that my images are messed up now too... did training with these new libraries for like a minute mess up the embed I was working on? xD _ZX_bYyp-EXc 1

C43H66N12O12S2 commented 1 year ago

No, you didn't mess up anything. PyTorch comes bundled with CUDA & friends (cudnn is one such friend) but they're old versions. Far older than the 40 series.

As for training, no idea.

dkeleee commented 1 year ago

As another datapoint, on my 4090, with all default webui options and batch size 1 I was getting 10it/s.

Replacing cudnn files pushed that up to 12it/s.

With the "f" version of xformers above it's now 15it/s.

SSCBryce commented 1 year ago

Well, we've made progress to be sure. Still, you get 23it/s on a single image? Crazy. I get anywhere between 12 and 15, depending on... RNG? when doing single image. Still, pretty dang good performance now, especially in batch mode as you say. 2I_7TDmLh1Ax 1

C43H66N12O12S2 commented 1 year ago

aaa Slightly slower due to having loaded VAE

ARintaro commented 1 year ago

On my 4090, with the xformers above and 8.6 cudnn Steps: 20, Sampler: Euler a, CFG scale: 7, Size: 512x512, Model hash: e6e8e1fc, Eta: 0.67, Clip skip: 2 image Is it working properly? And better than 3090 ?

SSCBryce commented 1 year ago

@C43H66N12O12S2 Holy MOLY man, what wizardry is that?! Here's mine, 1x8 for first run, 4x1 on second run (batch sizes). Please tell me we'll see some kind of performance like that on Windows eventually!! 59qsOu0rGv3- 1

C43H66N12O12S2 commented 1 year ago

I have 0 clue why everybody has slower speeds. With the cuDNN upgrade, you're using my exact setup.

on Windows eventually!!

I am on Windows.

SSCBryce commented 1 year ago

Ah, assumed Linux with the font. That is quite weird. Which card did you get exactly? Windows 11? I have the OC'd ASUS TUF, Win11.

C43H66N12O12S2 commented 1 year ago

Win11, Suprim X. Though whatever advantage the Suprim X might have would've been thoroughly crushed by my 100W lower power limit.

SSCBryce commented 1 year ago

Certainly. I was just curious. I/O isn't too intense after data is loaded, right? I have the directory on a USB 3.1 storage drive. Doubt putting it on m.2 would make much of a difference.

C43H66N12O12S2 commented 1 year ago

I doubt IO is the issue since other people in this thread reported similar speeds to you. At least one of them must be using it on a SSD.

FWIW, mine is on a SN850

ARintaro commented 1 year ago

I doubt IO is the issue since other people in this thread reported similar speeds to you. At least one of them must be using it on a SSD.

FWIW, mine is on a SN850

Mine is on a SSD too, but still slow. Does the bottleneck lie in my 5800X CPU and PCIE 3.0 motherboard ?

SSCBryce commented 1 year ago

Ah, that's an idea. Are you on PCIE 5.0 @C43H66N12O12S2? I don't believe I am.

C43H66N12O12S2 commented 1 year ago

4.0 & 11700K

ARintaro commented 1 year ago

我怀疑 IO 是问题,因为该线程中的其他人向您报告了类似的速度。其中至少有一个必须在 SSD 上使用它。 FWIW,我的是 SN850

我的也是SSD,但还是很慢。 瓶颈是在我的 5800X CPU 和 PCIE 3.0 主板上吗?

It's a mistake. Mine is on PCIE 4.0 too.

SSCBryce commented 1 year ago

Oh, dang. Thought we were on to something, as apparently my 10th gen is capped at 3.0 as well. Didn't even know tbh. Brb, Best Buy.

Bachine commented 1 year ago

so i ran the pip in venv and i get this: ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\stable-diffusion-webui\venv\Lib\site-packages\xformers\_C.pyd' Check the permissions.

SSCBryce commented 1 year ago

You probably just still have stable diffusion running perhaps?

Bachine commented 1 year ago

yeah lol, now that i installed it, it should run fast now?

Bachine commented 1 year ago

still under 10 it/s with a 512x512 image

SSCBryce commented 1 year ago

Did you update cuDNN as well?

Bachine commented 1 year ago

and how would i do that

SSCBryce commented 1 year ago

I uploaded them here: https://pomf2.lain.la/f/5u34v576.7z And they go to stable-diffusion-webui\venv\Lib\site-packages\torch\lib

Bachine commented 1 year ago

ok 14 it/s now, still seems slow for a 4090?

SSCBryce commented 1 year ago

Yes, there will most likely be more optimizations coming, but that seems to be the best we can do right now. Though you'll still get a lot more performance by simply batching. You should use full 8 batch size all the time, unless training. Then it depends on your data set size, I assume.

Bachine commented 1 year ago

well, thanks for your help its definitely an improvement!

cant wait for a proper optimization for the 4000 series

SSCBryce commented 1 year ago

It all goes to C4. Actually C4, do you have resizable bar enabled?

C43H66N12O12S2 commented 1 year ago

yes

noprompt commented 1 year ago

On WSL Ubuntu I get:

ERROR: xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl is not a supported wheel on this platform.

What can I do to fix this?

C43H66N12O12S2 commented 1 year ago

You can stop using WSL

My wheels are for Windows

cmp-nct commented 1 year ago

On WSL Ubuntu I get:

ERROR: xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl is not a supported wheel on this platform.

What can I do to fix this?

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2449#issuecomment-1279837988

That's for Windows but it shows how to get it running without wheels. Same type of error I ran into as all the tutorials use wheels.

michael-persch commented 1 year ago

I can confirm that I also get 21% less performance with stable diffusion on my RTX 4090 than on my RTX 3090.

C43H66N12O12S2 commented 1 year ago

@Farfie Try using channelslast. I tested without it today and saw a %20 performance drop. ~18it/s vs ~22it/s.

SSCBryce commented 1 year ago

@C43H66N12O12S2 Should this affect training as well? I added it and performance seems exactly the same so far. 124 512x768 png data set, style_filewords, same nai pruned, yaml loaded but no vae, batch size 3. JKSMx7hVyI8j 1

C43H66N12O12S2 commented 1 year ago

Yes, it should. Weird how it doesn't help you at all. Forgoing channels last consistently degrades my performance by %20 at least. Try pip install -U -I --no-deps torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 inside venv. Upgrade cuDNN again after doing this.

SSCBryce commented 1 year ago

Did all of that in that order, and now it's throwing that "PyTorch has cuda version 11.6 and torchvision has cuda 11.3. Please reinstall the torchvision that matches your pytorch install." before it finishes loading the webui. UOUcIvdG2DML 1

C43H66N12O12S2 commented 1 year ago

Do pip install -U -I --no-deps torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 inside venv.

SSCBryce commented 1 year ago

Did that, same speed. Tried replacing cudnn again because I wasn't sure if that was going on, but again, same speed. Maybe I should try removing the yaml again? Heh. Edit: yaml gone, same speed :( 6LqYm6uigxRP 1

C43H66N12O12S2 commented 1 year ago

Very weird. I know some people are waiting for optimized libraries for the 4090 or whatever, but that’s not going to happen. (Except PyTorch with CUDA 11.8, but that’s not enough to explain this discrepancy.)

This is basically it - and some people, including you, are still experiencing significantly worse performance.

just to make sure, you’re using the latest driver, right?

SSCBryce commented 1 year ago

NVIDIA Control Panel reports 522.25, yep.

That is very weird. Is there some command I can toss in to ensure I'm using all the needed stuff? Also, what % of people are we talking about? Could it be some hardware scandal yet unheard of on these new devices? The world may never know...

C43H66N12O12S2 commented 1 year ago

I mean people on this thread and sdg.

You can try python -m xformers.info, maybe. Anything else being broken would either error out or you wouldn't get even 15it/s.

SSCBryce commented 1 year ago

Yeah man I just don't know. The python -m xformers everything looks normal, except saying it doesn't find triton, whatever that is, probably not important.

After all this, here's some batch size 8 with otherwise normal parameters. I do want to note though that the GPU is making a clicking noise while processing, and it doesn't sound like coil whine, that's for sure. I have half a mind to replace it JUST to see if it makes a difference in processing speed at this point, on top of not sounding off. I wonder if others have noticed a noise. J7JX5xUerMjv 1

C43H66N12O12S2 commented 1 year ago

Do you have ECC enabled in NV Control Panel? If not, I'm beginning to seriously consider advising you to RMA as well, because this issue makes no logical sense.

SSCBryce commented 1 year ago

Would be weird if it was, I don't ever develop software and I've never messed with compute until this particularly based piece of software arrived on our doorsteps :D. But yeah, seems not. NJ2OTaV9E26Q 1