AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
142k stars 26.83k forks source link

RTX 4090 performance #2449

Open Bachine opened 2 years ago

Bachine commented 2 years ago

Is the 4090 fully supported in SD?

I am getting the same performance with the 4090 that my 3070 was getting.

Thomas-MMJ commented 2 years ago

what parameters are you using? You might want to do say a batch of 5+ images. Also is xformers installed and using --xformers? are you using --opt-channelslast ? half precision? etc.

Kosinkadink commented 2 years ago

I'm having the same issue, my 4090 is generating slower/around same speed as my 3090 used to on the same machine. I am using Windows 10, with --xformers as my only parameter. I will be setting up my 3090 in a different PC soon, so I will be able to provide some direct comparison it/s.

My "benchmark" is using just the prompt "chair" at 150 steps with all default settings (Euler a, 512x512, Scale 7, Clip Skip 1, ENSD 0).

Using --xformers doesn't seem to make a difference, either way I'm getting around 10.6 it/s.

C43H66N12O12S2 commented 2 years ago

Xformers currently lacks support for Lovelace (in fact, Pytorch also lacks it, I believe.)

Your quoted 3090 numbers are too low BTW. I get around 16it/s with your settings on a 3080 12GB.

I'll perform some further testing when my 4090 arrives. (and attempt to build xformers for it)

Kosinkadink commented 2 years ago

Gotcha, guess my 4090 performance will be meh until pytorch (and xformers) gets lovelace support.

Those numbers were for my 4090, my 3090 was not plugged in at the time, but it is now.

With those same settings, my 3090 gets around 15.7 it/s without --xformers. I don't have xformers set up yet on that machine (I'm running Ubuntu and will need to use workaround to get xformers installed properly).

So the 4090 currently is only 2/3rds the performance of a non-xformers 3090.

cmp-nct commented 2 years ago

I had hit 15-18 with my 3090 but now it's 13 No command line parameters, still same setup. Strange

C43H66N12O12S2 commented 2 years ago

Preliminary testing: 4090 (or JIT PTX) really dislikes channels last. Halves performance Without channels last, my 4090 is about 10 times slower than my 3080 with torch 1.12.1 + cu116

C43H66N12O12S2 commented 2 years ago

Updating cuDNN did the trick. Getting 15it/s without xformers.

To support Lovelace in xformers, we need a CUDA 11.8 build of PyTorch (I think.)

C43H66N12O12S2 commented 2 years ago

Nope, managed to build it. Getting a %43 speed-up compared to my 3080. 23it/s

With batch size 8, my 4090 is twice as fast compared to the 3080. ~40it/s

C43H66N12O12S2 commented 2 years ago

@ilcane87 @comp-nect @Kosinkadink Could you please test this wheel? This works on my 4090, but I need to make sure there isn't a regression (broken on Ampere or Pascal or whatever) with --force-enable-xformers https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/d/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

ilcane87 commented 2 years ago

@C43H66N12O12S2 Doesn't work for me on GeForce GTX 1060 6GB:

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
C43H66N12O12S2 commented 2 years ago

@ilcane87 Please try this one. Much thanks for testing these for me :) https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/e/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

SSCBryce commented 2 years ago

@C43H66N12O12S2 I'd love to test this on my 4090 :D. Do I just activate venv and then pip install it while it sits in webui root dir?

C43H66N12O12S2 commented 2 years ago

Yep

SSCBryce commented 2 years ago

@C43H66N12O12S2 Oh, also says same version, should've seen that coming... assuming I also just --force-reinstall?

SSCBryce commented 2 years ago

Sorry if I'm not supposed to be pinging you, don't really ever develop software... seem to have borked it. It's the same issue I was having when I was trying to build this stuff myself. This is after it failed to load and I also added --skip-torch-cuda-test to make it load at least. TqjHNFr8bkS9 1

C43H66N12O12S2 commented 2 years ago

pip seems to have replaced your torch with a CPU only one. do pip uninstall torch and start repo again.

boyjunqiang commented 2 years ago

On ubuntu with xformers, my 3090 can get 20it/s; It's look like 4090 not improve much, maybe still need waiting for some driver update

ilcane87 commented 2 years ago

@ilcane87 Please try this one. Much thanks for testing these for me :) https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/e/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

Still getting the same error on this one too.

SSCBryce commented 2 years ago

@C43H66N12O12S2 Is there any way to ensure it's using the new wheel? Getting about 11it/s on 512x512, 8 cfg, 40 steps, euler a on NAI (not sure all of which these affect performance so noting most settings).

C43H66N12O12S2 commented 2 years ago

Well, it would error with an older wheel. You need to add --force-enable-xformers to your COMMANDLINE_ARGS, btw. (for now)

SSCBryce commented 2 years ago

Oh, interesting. It is in there, though, along with no-half and precision full for NAI. Not sure if the model is making the difference.

C43H66N12O12S2 commented 2 years ago

Can you test your performance without xformers? I get 23it/s and --no-half should be about half that speed, so 11it/s sounds about right, actually.

SSCBryce commented 2 years ago

Weird. On first load without forced xformers, it never got past Commit hash:blabla. Closing the cmd and starting it again though produced a regular loadup time. Without xformers, same generation settings and cmdargs, getting about 9.6it/s. No batching.

C43H66N12O12S2 commented 2 years ago

Yeah, xformers is working for you. Use larger batches for bigger gains. Also consider removing --no-half and --precision full, IME FP16 and FP32 have only minute differences but FP16 is twice the speed.

C43H66N12O12S2 commented 2 years ago

@ilcane87 Please test this latest one: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

SSCBryce commented 2 years ago

Was literally about to ask if I really need those, yeah, thanks for pointing it out! Wish more coders we're as clearly spoken and thoughtful as you are, man. Two random questions while I have your attention :D Should this help training, too? That's what I'm currently into, and was seeing similar numbers. Should I batch that as well? And then, is it better to increase batch count or size? Same either way? I always see a "decrease" in it/s when I batch but maybe that's just the UI giving weird numbers.

C43H66N12O12S2 commented 2 years ago

Yes. Increasing size is equal to increasing count, but increasing size will also increase your speed with these large GPUs (practically anything faster than a 1070) where the VRAM bandwidth is a large bottleneck.

You're seeing a decrease because it's generating multiple images at once. To calculate what it'd correspond to for a single image, do iterations per second * batch size

SSCBryce commented 2 years ago

Ah, makes sense, very good to know. However... something is strange again, screenshot. G0n53LcPQaDU 1

ilcane87 commented 2 years ago

@ilcane87 Please test this latest one: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

This one works!

C43H66N12O12S2 commented 2 years ago

@Farfie I assume you're referring to the slow speeds. Are you using the full-ema model? Prune it or just use one of the pruned ones.

SSCBryce commented 2 years ago

Yes that's what I mean, apologies. Using "animefull-final-pruned" from the leaked stableckpt, not sure if there was another version people have started tossing around. Though I did forget to put the vae back in after training, perhaps that has something to do with it.

C43H66N12O12S2 commented 2 years ago

Try deleting/moving your .yaml file. If you still get slow speeds, that needs a larger troubleshooting effort.

SSCBryce commented 2 years ago

Hmm, moving the yaml kept things the same. 3.5it/s for "100%" and 3it/s on "Total progress" bar, batch size 8. Do you monitor GPU power? I have a feeling my PSU just isn't outputting what it wants, never really goes above 87 or 90%. Not that this is the place for this, I understand. Thanks for everything, really appreciate it.

Kosinkadink commented 2 years ago

I wish I could test this right now, but I'm on a trip for the weekend. Normally I'd leave my PC running so I can remote into it and do some work, but having the 4090 for only a few days didn't give me enough confidence it wouldn't accidentally burn my home down while I'm gone.

Bachine commented 2 years ago

@ilcane87 Please test this latest one: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

silly question as i have no experience with this kinda stuff, where do i put this file and what do to make sure it works?

C43H66N12O12S2 commented 2 years ago

Just activate venv (if you're using it) and do pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

Bachine commented 2 years ago

im sorry, i need it more simplified XD, in cmd what cd am i pip installing the file into

SSCBryce commented 2 years ago

Are you in Windows? Run that command from the root directory of stable-diffusion-webui.

Bachine commented 2 years ago

ok so cd (my main stable diffusion folder)?

SSCBryce commented 2 years ago

Yep. Are you also using MINGW or just a command prompt?

Bachine commented 2 years ago

just cmd, im still getting under 10it/s with that xformer, are there specific arguments i need to boost the speed for my 4090?

Kosinkadink commented 2 years ago

Be sure to activate the venv like they said. If you're in Windows, do this by running the venv/Scripts/activate.bat script (if I remember correctly). You'll know you're in the venv when your command prompt input has (venv) appended to the beginning.

SSCBryce commented 2 years ago

I'm assuming you set COMMANDLINE_ARGS=--force-enable-xformers?

C43H66N12O12S2 commented 2 years ago

@Farfie 123123

Bachine commented 2 years ago

when i click the activate.bat, it pops up for a millisecond but doesn't stay open

Kosinkadink commented 2 years ago

@Bachine The activate .bat is the script that will set your current command prompt session to reference that particular virtual environment, so you need to run it via command prompt after cd'ing to the stable-diffusion-webui directory. By executing that script, your current command prompt session will be set to reference the same virtual environment your webui-user.bat uses. If you don't do this step, your 'pip install ' command will affect your global python installation instead of the virtual environment your local stable-diffusion-webui uses.

SSCBryce commented 2 years ago

@C43H66N12O12S2 Only 335? Even that seems a bit low. I reset the values and timer after about a minute of embed training batch size 2 on NAI (3 doesn't increase performance and 4 will throw a cuda out of memory error). The power really seems to fluctuate wildly, and constantly for me. GPU TWEAK constantly showing GPU power go from 75% down to even 30% sometimes, temp stays around high 50s, peaks at maybe 63C. Can hear the fans going up down up down... Q2VkCh8agaMC 1

C43H66N12O12S2 commented 2 years ago

Well, I've lowered the power limit down to 337W, so anything higher than that will be a transient spike.

That's to say, your issue - whatever it is - isn't related to your PSU.

Bachine commented 2 years ago

so while im in venv, i pip install to main stable diffusion directory?

SSCBryce commented 2 years ago

@C43H66N12O12S2 Oh, didn't take that limit literally. Foolish of me. I suppose now I'll try turning off all other software in an attempt to get it to behave normally... any chance it's CPU bound? It's only an i5 10400, but it seems to pull maybe 38% on average.