AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.54k stars 26.45k forks source link

RTX 4090 performance #2449

Open Bachine opened 1 year ago

Bachine commented 1 year ago

Is the 4090 fully supported in SD?

I am getting the same performance with the 4090 that my 3070 was getting.

C43H66N12O12S2 commented 1 year ago

@Farfie Just a note, your model construction is wrong (probably due to the yaml). It's 1, 4, 64, 64 while it should be:

making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
SSCBryce commented 1 year ago

Honestly I don't even know what it does, just read a guide. Guide said to use those and the vae, and remove vae if training. Not sure if any of it is true. Removing the yaml did in fact make it read as you say it should. Should I just keep it off?

C43H66N12O12S2 commented 1 year ago

IIRC, the .yaml file is a relic of people attempting to replicate NAI with use_ema=True or whatever (the pruned model is EMA, using the full and then setting use_ema=True is literally wasting VRAM), so in that sense, it's useless.

I've never trained a hypernetwork or embedding or whatever, so maybe it's useful for that.

dkeleee commented 1 year ago

@C43H66N12O12S2 What kind of speeds are you getting at batch size 8? I noticed that if I underclock my cpu, my batch size 1 speed drops accordingly so I'm actually cpu bottlenecked in that case making the result useless.

Around the start of this thread you mentioned 40it/s for batch size 8. If you were multiplying the reported it/s by 8 then that matches Farfie's result above as well as my own. I'm getting ~5.40it/s.

C43H66N12O12S2 commented 1 year ago

Size 1: 22-24it/s Size 8: 5.3-5.5it/s Size 16: 2.7it/s

I did consider the CPU bottleneck aspect, but there's a person in the thread with a 5800X experiencing the same issue. I suppose it's possible my 11700K is just a lot better than the 5800X in this specific application.

@dkeleee @Farfie could you post your DRAM speeds? Mine is 3600-14-19-19-36-550-1T 4RPC

SSCBryce commented 1 year ago

@C43H66N12O12S2 Are those the it/s on the first or second row? Mine will usually be close to that on the first row, but after it does its post processing (or whatever it's doing before it finishes), it ends up much lower on that "second row", which is visible in my screenshots, like 3.7it/s. t2Rp6TBZVGNY 1

C43H66N12O12S2 commented 1 year ago

First, though both rows converge at, say, 100 steps.

dkeleee commented 1 year ago

3200 CL16-18-18

I'm also getting 2.7it/s for size 16.

SSCBryce commented 1 year ago

Just out of curiosity, what do you get training on NAI (no vae or yaml) with these settings? Or does it depend on data set? I get about 1.5. If you're up to it, anyway. OL4ioFoIvbvq 1

C43H66N12O12S2 commented 1 year ago

1.57

SSCBryce commented 1 year ago

That's honestly pretty in-line, then. If I turn off all other software I get about 1.55. Is it stable there? Because for me, it constantly goes up and down from like 1.56 to 1.46, then back up, ad infinitum. And doing something like opening a browser tab lol, or scrolling can send it down even to like... 1.03. It's brutal, but ofc that is just how it goes.

Yeah wow, right after I typed that, I merely switched tabs, and got some strange behavior. The it/s plummeted to 1.0 EXACTLY, and actually stayed there until I scrolled again. Wait a second... it's very reproducable. Surely this is just software though, lemme switch browsers lol.

C43H66N12O12S2 commented 1 year ago

The lowest it went to is 1.53, I think. I didn't test for very long. Anyhow, it's probably a CPU (or DRAM) bottleneck. The curious part, then, is why the 5800X underperforms.

SSCBryce commented 1 year ago

Are you using Firefox? I'm getting more stable performance simply not using Brave (chromium base is more important I'd assume).

C43H66N12O12S2 commented 1 year ago

Chromium

100microIQ commented 1 year ago

Hi.

I installed my 4090 today and when I try to get the xformers installed it doens't work. I'm on win 10 pro N

_(venv) D:\stable-diffusion\git installed\stable-diffusion-webui>pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl_

ERROR: xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl is not a supported wheel on this platform.

Also getting kind of sad performance in general in terms of it/s

SSCBryce commented 1 year ago

Maybe it is Brave then. Again not seeing a direct numbers improvement/increase per se, but I was always also getting strangely unstable numbers the entire time. What are these wankers at Brave up to, pardon my French...

C43H66N12O12S2 commented 1 year ago

@100microIQ What's your python version?

100microIQ commented 1 year ago

says "Python 3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)] on win32: when I type python into a cmd prompt so I guess that one?

C43H66N12O12S2 commented 1 year ago

Update your python. Use version 3.10

100microIQ commented 1 year ago

I upgraded pythong to 3.10.8 It works when I pip it now, but only when not in the venv? From the venv it still rejects it with the same error.

Could the venv run a different version of python? I type python into that one and it says 3.8.8 and it still rejects the pip?

I assume that when I start from the webui-user.bat it activates the venv automatically as the default behavior?

C43H66N12O12S2 commented 1 year ago

Your venv is probably stuck at 3.8. Just nuke it.

100microIQ commented 1 year ago

Not sure how to do that in the least painless manner

C43H66N12O12S2 commented 1 year ago

run python -m venv --upgrade venv/

100microIQ commented 1 year ago

The venv upgraded itself and the pip went through.

But now the webUI doesn't want to run anymore instead:

error wall of text

C43H66N12O12S2 commented 1 year ago

I’d recommend deleting venv/ and starting from scratch. You can fix this but it’ll basically amount to deleting venv but with far more steps.

SSCBryce commented 1 year ago

Well I downloaded Chromium, looked at the blank slate it was, realized I couldn't import things (don't know why though tbh), and decided to migrate some other time. Turning off hardware acceleration "fixes" it anyway though, as expected... /blog.

100microIQ commented 1 year ago

I nuked venv, regenerated it by running webui-user.bat. baseline 8 it/s installed xformers in the venv and it got up to 10 it/s. did the cudnn fix and it was up to 15 it/s. 8 batch is running at 5.5 it/s so 44 it/s for that. I guess that means I'm around the cutting edge then.

Thanks a lot @C43H66N12O12S2 you really made the troubleshooting painless

Kosinkadink commented 1 year ago

Finally got home and did some testing. Here are my speeds (first run skipped, the following 10 runs averaged) with the various fixes posted here, using default settings (except Steps=150 to allow it/s to settle) with prompt "chair" (batch size 1):

Fresh clone of repo: ~10.5 it/s With only cudnn files updated (NO --xformers): ~19.5 it/s With cudnn files updated AND --xformers (pip installed C43's wheel): ~25.6 it/s

System info: GPU: RTX 4090 CPU: 5950x RAM: 64GB (4x16gb) at 3200MHz C16 OS: Windows 10

Bachine commented 1 year ago

I'll test this when I'm home,

A guide in the wiki for all this would be nice.

SSCBryce commented 1 year ago

It will be built in once it can be shown to work without regressing other hardware setups installs.

liasece commented 1 year ago

I have made some new discoveries with 4090, if I use the --no-half parameter, there is no difference in inference speed from when I don't use that parameter. But on 3090, this parameter affects the choice between float32 and float16, resulting in a big change in inference speed.

I suspect, therefore, that true float16 is not enabled on the 4090 and that it has been using float32 for inference resulting in slow speed.

swcrazyfan commented 1 year ago

On ubuntu with xformers, my 3090 can get 20it/s; It's look like 4090 not improve much, maybe still need waiting for some driver update

Did you build your wheel for download it? For some reason, I can only get 3/its on VastAI and other GPU providers even with a 3090. It's basically a huge waste of money.

Thomas-MMJ commented 1 year ago

On ubuntu with xformers, my 3090 can get 20it/s; It's look like 4090 not improve much, maybe still need waiting for some driver update

Did you build your wheel for download it? For some reason, I can only get 3/its on VastAI and other GPU providers even with a 3090. It's basically a huge waste of money.

You might try,

conda install xformers -c xformers/label/dev

, but they are only available for Python 3.9 or 3.10, CUDA 11.3 or 11.6, and PyTorch 1.12.1

devilismyfriend commented 1 year ago

I uploaded them here: https://pomf2.lain.la/f/5u34v576.7z And they go to stable-diffusion-webui\venv\Lib\site-packages\torch\lib

do you have the linux binaries by any chance?

sigglypuff commented 1 year ago

Anyone know what this error is being caused by? I've attempted to generate my own xformers/use the ones here but no matter which i use i cannot get past this error "The procedure entry point ?matmil@at@@YA?AVTensor@1@AEBV21@0@Z could not be located in the dynamic link library D:\stable-diffusion-webui\venv\Lib\site-package\xformers_C.pyd"

in the actual cmd line it states this after clicking ok "WARNING:root:WARNING: [WinError 127] The specified procedure could not be found Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop"

Manimap commented 1 year ago

Is there any step by step procedure/wiki to enable all of the tweaks in order to have the speed up evoked here?

JackCloudman commented 1 year ago

I tried:

And for k_euler_a 150 steps 7 cfg just got 15it/s :'( Params: --xformers --force-enable-xformers --no-half Specs:

devilismyfriend commented 1 year ago

I tried:

  • Install cuda 11.8
  • Replace cudnn with the files of @Farfie
  • Installed xformers with command: pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/d/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

And for k_euler_a 150 steps 7 cfg just got 15it/s :'( Params: --xformers --force-enable-xformers --no-half Specs:

  • RTX 4090
  • i9 9900k
  • 32GB RAM 3200MHz
  • Windows 11

Yeah I'm in the same boat, at this point I think it may be the CPU+RAM configuration that's bottlenecking, cause batch size performance is fine I believe.

RTX 4090 i7 9700k 64GB RAM 3000Mhz(I think, not sure) Win 11 22H2

Kosinkadink commented 1 year ago

@JackCloudman Is there a specific reason why your arguments are "--xformers --force-enable-xformers --no-half" and not just "--xformers"? I am on Windows 10 with a 5950X, but if I do just "--xformers", I get 25it/s with the prompt "chair" at the default settings (and setting Steps to 50 or greater to let the speed even out - running at least twice as the first run is always slow). But when I do the "--xformers --force-enable-xformers --no-half" arguments, it slows down to 16-17it/s.

Unless I'm missing something, what's the purpose of needing those other arguments?

Kosinkadink commented 1 year ago

Another thing I noticed, sometimes the generation just slows down for me to 10-11it/s with those exact settings, even if I stop and then start the webui again. Only thing that fixes it for me is rebooting the whole PC, although it might also have something to do with some of the earlier comments that said they got things working by just changing the refresh rate of the monitor. Next time I get that bug, I'll just try changing monitor settings back and forth and see if it's some extreme edge case where something the GPU is caching needs to get force cleared out. I'll report back with the results.

Zeniere commented 1 year ago

Another thing I noticed, sometimes the generation just slows down for me to 10-11it/s with those exact settings, even if I stop and then start the webui again. Only thing that fixes it for me is rebooting the whole PC, although it might also have something to do with some of the earlier comments that said they got things working by just changing the refresh rate of the monitor. Next time I get that bug, I'll just try changing monitor settings back and forth and see if it's some extreme edge case where something the GPU is caching needs to get force cleared out. I'll report back with the results.

I don't know if this is your problem but I use a 1080ti in a hyper v and sometimes the gpu won't go into a higher power state. Check nvidia-smi. You can run some simple 3d app to make it change power state.

DreamHajime commented 1 year ago

Hi everyone, I tried the methods above, they did work but not ideally, and get results as follows:

I usually set up my computer this way, for temperature and stability reasons. 5900x (4.5ghz all cores) + RTX 4090 Suprim X (default settings, usually running at 2805mhz) + 3600c16 8gx4

Following are some parameters applied to stable diffusion Eular a, 20 steps, e6e8e1fc, seed 100000000

Baseline: 5900x (4.5ghz all cores) + 4090 (default) + 3600c16 batch count 9, batch size 1, 512x512, 18.79 - 20.62 it/s batch count 9, batch size 8, 512x512, 5.49 - 5.60 it/s batch count 9, batch size 1, 1024x1024, 7.49 - 7.53 it/s batch count 3, batch size 8, 1024x1024, 1.04 - 1.07 it/s (it should be tested with 9 batches, but it cost too much time since the speed is stable enough)

Better single thread: 5900x (PBO enabled,4.85ghz single core) + 4090 (default) + 3600c16 batch count 9, batch size 1, 512x512, 18.35 - 20.98 it/s batch count 9, batch size 8, 512x512, 5.26 - 5.59 it/s batch count 9, batch size 1, 1024x1024, 7.51 - 7.54 it/s batch count 3, batch size 8, 1024x1024, 1.04 - 1.07 it/s

Better memory: 5900x (4.5ghz all cores) + 4090 (default) + 3800c16 batch count 9, batch size 1, 512x512, 18.75 - 20.22 it/s batch count 9, batch size 8, 512x512, 5.56 - 5.59 it/s batch count 9, batch size 1, 1024x1024, 7.50 - 7.54 it/s batch count 3, batch size 8, 1024x1024, 1.04 - 1.11 it/s

Worse all core and memory: 5900x (4.0ghz all cores) + 4090 (default) + 2666c19 (when 4.0ghz is applied, the computer cannot boot with XMP files, so two parameters were both changed in this situation.) batch count 9, batch size 1, 512x512, 15.03 - 17.62 it/s batch count 9, batch size 8, 512x512, 5.56 - 5.58 it/s batch count 9, batch size 1, 1024x1024, 7.50 - 7.52 it/s batch count 3, batch size 8, 1024x1024, 1.04 - 1.06 it/s

Worse memory: 5900x (4.5ghz all cores) + 4090 (default) + 2666c19 batch count 9, batch size 1, 512x512, 17.43 - 18.88 it/s batch count 9, batch size 8, 512x512, 5.58 - 5.59 it/s batch count 9, batch size 1, 1024x1024, 7.49 - 7.54 it/s batch count 3, batch size 8, 1024x1024, 1.04 - 1.05 it/s

The speed of the first batch of all tests was discarded since it was always significantly lower than the average and not stable

In the case of batch size 1, 512x512, the graphics card is not running at full wattage (I also discovered this during HWiNFO64, usually only about 200w~250w, while the graphics card works above 400w under other test conditions)

If calculated from the higher pressure test, the speed should be 30~40it/s at this time, and different test parameters (CPU performance and memory frequency or latency) do affect the speed.

So I think the reason other users can't get the ideal results after trying various solutions above may be because the test parameters are not proper, low resolution and small batch size are more like a test of CPU and memory or other factors other than RTX 4090 performance, just like playing CSGO at 1080p.

It would be more accurate to discuss whether our graphics card has achieved what speed it should be after appropriately increasing the resolution or batch size together.

YourFriendlyNeighborhoodMONKE commented 1 year ago

I have a sneaking suspicion that 4090 running on PCIe 3.0 mobo might be a big bottleneck...

Can someone confirm that they are running a PCIe 3.0 system with higher than 16 it/s on any GPU?

ChucklesTheBeard commented 1 year ago

I have a sneaking suspicion that 4090 running on PCIe 3.0 mobo might be a big bottleneck...

Can someone confirm that they are running a PCIe 3.0 system with higher than 16 it/s on any GPU?

20 steps, 512x512, RTX 3090 Ti (stock clocks), PCIe 3.0 mobo (x16 slot), commit 7ba3923d5b494b7756d0b12f33acb3716d830b9a

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00,  2.92it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00,  2.92it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00,  2.92it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00,  3.00it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00,  2.99it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00,  3.03it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00,  2.72it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00,  2.64it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00,  2.63it/s]
Total progress: 100%|████████████████████████████████████████████████████████████████| 180/180 [01:21<00:00,  2.21it/s]

2.21*8=17.68it/s/img, not sure if that's what you had in mind. batch size 1 w/ same settings -> 11.4 it/s

I also get 3.13it/s at 100 steps batch size 8 = 25.04 it/s/img 100 steps batch size 1 = 16.03it/s, pushing it out to 500 steps batch size 1 = 16.52it/s

seggybop commented 1 year ago

I have a sneaking suspicion that 4090 running on PCIe 3.0 mobo might be a big bottleneck...

Can someone confirm that they are running a PCIe 3.0 system with higher than 16 it/s on any GPU?

I have 4090 attached via Thunderbolt 3 (< PCIe 3.0 4x). I have updated CUDA DLLs and theoretically new xformers (though I didn't notice any effect from that).

batch count 9, batch size 1, 512x512:

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  7.47it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  7.78it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  6.25it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  7.91it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.27it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  7.87it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.63it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  7.82it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  7.85it/s]
Total progress: 100%|████████████████████████████████████████████████████████████████| 180/180 [00:27<00:00,  6.62it/s]

batch count 9, batch size 8, 512x512:

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.42it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.64it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.68it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.63it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.73it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.72it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.58it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.72it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.73it/s]
Total progress: 100%|████████████████████████████████████████████████████████████████| 180/180 [01:00<00:00,  2.97it/s]

batch count 9, batch size 16, 512x512:

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:14<00:00,  1.39it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.36it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.42it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.32it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.30it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.29it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:09<00:00,  2.17it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.32it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.39it/s]
Total progress: 100%|████████████████████████████████████████████████████████████████| 180/180 [02:10<00:00,  1.38it/s]

batch count 9, batch size 32, 512x512:

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:16<00:00,  1.24it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:17<00:00,  1.14it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:20<00:00,  1.02s/it]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:20<00:00,  1.02s/it]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.08it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.08it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:21<00:00,  1.08s/it]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:16<00:00,  1.21it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00,  1.00it/s]
Total progress: 100%|████████████████████████████████████████████████████████████████| 180/180 [05:27<00:00,  1.22it/s] 

I think that there is a bottleneck during the initialization process, but not so much while it's running, as shown by better scaling with big batches. The it/s on a single 512x512 is very poor compared to previously posted results, but the large batch size catches up.

Cleroth commented 1 year ago

Went from 3.45 it/s to 28.8 it/s after updating cudnn from https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/11.8/ (replacing existing files in stable-diffusion-webui\venv\Lib\site-packages\torch\lib\*). It's a bit tough finding info about fixing this at the moment.

YourFriendlyNeighborhoodMONKE commented 1 year ago

Went from 3.45 it/s to 28.8 it/s after updating cudnn from https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/11.8/ (replacing existing files in stable-diffusion-webui\venv\Lib\site-packages\torch\lib\*). It's a bit tough finding info about fixing this at the moment.

That's the wildest jump I've seen anyone report on just cudnn file replacement

Did you replace with the contents of the bin folder of the cudnn package or some others too?

Cleroth commented 1 year ago

Went from 3.45 it/s to 28.8 it/s after updating cudnn from https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/11.8/ (replacing existing files in stable-diffusion-webui\venv\Lib\site-packages\torch\lib\*). It's a bit tough finding info about fixing this at the moment.

That's the wildest jump I've seen anyone report on just cudnn file replacement

Did you replace with the contents of the bin folder of the cudnn package or some others too?

Only that. I already had xformers on though. I did replace it with the new one stated here but it was only a minor improvement. As someone mentioned, maybe some users are getting bottlenecked by something else. I have an i9-13900K with 5800 MHz DDR5, which likely helps.

YourFriendlyNeighborhoodMONKE commented 1 year ago

Went from 3.45 it/s to 28.8 it/s after updating cudnn from https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/11.8/ (replacing existing files in stable-diffusion-webui\venv\Lib\site-packages\torch\lib\*). It's a bit tough finding info about fixing this at the moment.

That's the wildest jump I've seen anyone report on just cudnn file replacement Did you replace with the contents of the bin folder of the cudnn package or some others too?

Only that. I already had xformers on though. I did replace it with the new one stated here but it was only a minor improvement. As someone mentioned, maybe some users are getting bottlenecked by something else. I have an i9-13900K with 5800 MHz DDR5, which likely helps.

Yeah, I bet there's some kind of other hardware bottlenecking going on I'm on 8700K, PCIe 3.0 system with sub-19 it/s after optimizations and I've seen quite a few people with similar being stuck at 15 it/s level

However I have seen couple examples where fresh Windows installs with latest software might suggest that there could also be some kind of Python version conflicts or something else also going on. It's so hard to tell

zencyon commented 1 year ago

Glad i found this thread, thanks for all the info and help <3 SD is 10x more fun now xD

was getting slow speeds on 4090, 2-3it/s, tried various clean installs in various combinations following these posts around 8-15it/s was what i got up to - until my last install which got me above 28it/s, yay :D

noticed sometimes the card keep spinning for a while after a render, but maybe other things are interfering with the speed (also noticed having to reboot sometimes to get speed back)

**EDIT: since i still see people linking to this post, I updated it with the newest steps i'm using and which is getting the best results for me, from sa-shiro https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2449#issuecomment-1349459288 I'll leave the old steps at the end, but better to use the new method

NEW METHOD (cu117 and cudnn8.7), 5 steps

1 git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git

2 edit launch.py: replace this line with: (line num changes so i removed them, between 160 - 180 approx) torch_command = os.environ.get('TORCH_COMMAND', "pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117")

replace this line with: xformers_windows_package = os.environ.get('XFORMERS_WINDOWS_PACKAGE', 'https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/torch13/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl')

launch.py should look like this: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2449#issuecomment-1352165362

3 download cuda 8.7 files: https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/cudnn-windows-x86_64-8.7.0.84_cuda11-archive.zip copy .dll files from the "bin" folder in that zip file, replace the ones in "stable-diffusion-main\venv\Lib\site-packages\torch\lib"

4 add --xformers to web-user.bat command arguments

5 add model to \models\Stable-diffusion run webui-user.bat done!


OLD METHOD 1 git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git

2 edit launch.py: replace torch_command = os.environ.get('TORCH_COMMAND', "pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113") with torch_command = os.environ.get('TORCH_COMMAND', "pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116") run web-user.bat

3 download cuda files from https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/11.8/ copy .dll files from the "bin" folder in that zip file, replace the ones in "stable-diffusion-main\venv\Lib\site-packages\torch\lib"

4 download file locally from: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/d/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl *EDIT: maybe download and use instead: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/torch13/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl copy xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl file to root SD folder venv\Scrips\activate pip install xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

5 add --xformers to web-user.bat command arguments

6 add model run webui-user.bat

7 other things: (likely not needed) used firefox with hardware acceleration disabled in settings on previous attempts I also tried --opt-channelslast --force-enable-xformers but in this last run i got 28it/s without them for some reason

Results, default settings, empty prompt:

batch of 8: best: 3.54it/s (28.32it/s), typical 3.45 (27.6it/s)

single image: best 22.60it/s average: 19.50it/s

system: RTX 4090, Ryzen 3950x, 64GB 3600Mhz, M2 NVME 4090sdcuda