Open Bachine opened 2 years ago
what parameters are you using? You might want to do say a batch of 5+ images. Also is xformers installed and using --xformers? are you using --opt-channelslast ? half precision? etc.
I'm having the same issue, my 4090 is generating slower/around same speed as my 3090 used to on the same machine. I am using Windows 10, with --xformers as my only parameter. I will be setting up my 3090 in a different PC soon, so I will be able to provide some direct comparison it/s.
My "benchmark" is using just the prompt "chair" at 150 steps with all default settings (Euler a, 512x512, Scale 7, Clip Skip 1, ENSD 0).
Using --xformers doesn't seem to make a difference, either way I'm getting around 10.6 it/s.
Xformers currently lacks support for Lovelace (in fact, Pytorch also lacks it, I believe.)
Your quoted 3090 numbers are too low BTW. I get around 16it/s with your settings on a 3080 12GB.
I'll perform some further testing when my 4090 arrives. (and attempt to build xformers for it)
Gotcha, guess my 4090 performance will be meh until pytorch (and xformers) gets lovelace support.
Those numbers were for my 4090, my 3090 was not plugged in at the time, but it is now.
With those same settings, my 3090 gets around 15.7 it/s without --xformers. I don't have xformers set up yet on that machine (I'm running Ubuntu and will need to use workaround to get xformers installed properly).
So the 4090 currently is only 2/3rds the performance of a non-xformers 3090.
I had hit 15-18 with my 3090 but now it's 13 No command line parameters, still same setup. Strange
Preliminary testing: 4090 (or JIT PTX) really dislikes channels last. Halves performance Without channels last, my 4090 is about 10 times slower than my 3080 with torch 1.12.1 + cu116
Updating cuDNN did the trick. Getting 15it/s without xformers.
To support Lovelace in xformers, we need a CUDA 11.8 build of PyTorch (I think.)
Nope, managed to build it. Getting a %43 speed-up compared to my 3080. 23it/s
With batch size 8, my 4090 is twice as fast compared to the 3080. ~40it/s
@ilcane87 @comp-nect @Kosinkadink Could you please test this wheel? This works on my 4090, but I need to make sure there isn't a regression (broken on Ampere or Pascal or whatever) with --force-enable-xformers
https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/d/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
@C43H66N12O12S2 Doesn't work for me on GeForce GTX 1060 6GB:
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
@ilcane87 Please try this one. Much thanks for testing these for me :) https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/e/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
@C43H66N12O12S2 I'd love to test this on my 4090 :D. Do I just activate venv and then pip install it while it sits in webui root dir?
Yep
@C43H66N12O12S2 Oh, also says same version, should've seen that coming... assuming I also just --force-reinstall?
Sorry if I'm not supposed to be pinging you, don't really ever develop software... seem to have borked it. It's the same issue I was having when I was trying to build this stuff myself. This is after it failed to load and I also added --skip-torch-cuda-test to make it load at least.
pip seems to have replaced your torch with a CPU only one. do pip uninstall torch
and start repo again.
On ubuntu with xformers, my 3090 can get 20it/s; It's look like 4090 not improve much, maybe still need waiting for some driver update
@ilcane87 Please try this one. Much thanks for testing these for me :) https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/e/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
Still getting the same error on this one too.
@C43H66N12O12S2 Is there any way to ensure it's using the new wheel? Getting about 11it/s on 512x512, 8 cfg, 40 steps, euler a on NAI (not sure all of which these affect performance so noting most settings).
Well, it would error with an older wheel. You need to add --force-enable-xformers to your COMMANDLINE_ARGS, btw. (for now)
Oh, interesting. It is in there, though, along with no-half and precision full for NAI. Not sure if the model is making the difference.
Can you test your performance without xformers? I get 23it/s and --no-half should be about half that speed, so 11it/s sounds about right, actually.
Weird. On first load without forced xformers, it never got past Commit hash:blabla. Closing the cmd and starting it again though produced a regular loadup time. Without xformers, same generation settings and cmdargs, getting about 9.6it/s. No batching.
Yeah, xformers is working for you. Use larger batches for bigger gains. Also consider removing --no-half and --precision full, IME FP16 and FP32 have only minute differences but FP16 is twice the speed.
@ilcane87 Please test this latest one: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
Was literally about to ask if I really need those, yeah, thanks for pointing it out! Wish more coders we're as clearly spoken and thoughtful as you are, man. Two random questions while I have your attention :D Should this help training, too? That's what I'm currently into, and was seeing similar numbers. Should I batch that as well? And then, is it better to increase batch count or size? Same either way? I always see a "decrease" in it/s when I batch but maybe that's just the UI giving weird numbers.
Yes. Increasing size is equal to increasing count, but increasing size will also increase your speed with these large GPUs (practically anything faster than a 1070) where the VRAM bandwidth is a large bottleneck.
You're seeing a decrease because it's generating multiple images at once. To calculate what it'd correspond to for a single image, do iterations per second * batch size
Ah, makes sense, very good to know. However... something is strange again, screenshot.
@ilcane87 Please test this latest one: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
This one works!
@Farfie I assume you're referring to the slow speeds. Are you using the full-ema model? Prune it or just use one of the pruned ones.
Yes that's what I mean, apologies. Using "animefull-final-pruned" from the leaked stableckpt, not sure if there was another version people have started tossing around. Though I did forget to put the vae back in after training, perhaps that has something to do with it.
Try deleting/moving your .yaml file. If you still get slow speeds, that needs a larger troubleshooting effort.
Hmm, moving the yaml kept things the same. 3.5it/s for "100%" and 3it/s on "Total progress" bar, batch size 8. Do you monitor GPU power? I have a feeling my PSU just isn't outputting what it wants, never really goes above 87 or 90%. Not that this is the place for this, I understand. Thanks for everything, really appreciate it.
I wish I could test this right now, but I'm on a trip for the weekend. Normally I'd leave my PC running so I can remote into it and do some work, but having the 4090 for only a few days didn't give me enough confidence it wouldn't accidentally burn my home down while I'm gone.
@ilcane87 Please test this latest one: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
silly question as i have no experience with this kinda stuff, where do i put this file and what do to make sure it works?
Just activate venv (if you're using it) and do pip install -U -I --no-deps https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/f/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl
im sorry, i need it more simplified XD, in cmd what cd am i pip installing the file into
Are you in Windows? Run that command from the root directory of stable-diffusion-webui.
ok so cd (my main stable diffusion folder)?
Yep. Are you also using MINGW or just a command prompt?
just cmd, im still getting under 10it/s with that xformer, are there specific arguments i need to boost the speed for my 4090?
Be sure to activate the venv like they said. If you're in Windows, do this by running the venv/Scripts/activate.bat script (if I remember correctly). You'll know you're in the venv when your command prompt input has (venv) appended to the beginning.
I'm assuming you set COMMANDLINE_ARGS=--force-enable-xformers
?
@Farfie
when i click the activate.bat, it pops up for a millisecond but doesn't stay open
@Bachine The activate .bat is the script that will set your current command prompt session to reference that particular virtual environment, so you need to run it via command prompt after cd'ing to the stable-diffusion-webui directory. By executing that script, your current command prompt session will be set to reference the same virtual environment your webui-user.bat uses. If you don't do this step, your 'pip install ' command will affect your global python installation instead of the virtual environment your local stable-diffusion-webui uses.
@C43H66N12O12S2 Only 335? Even that seems a bit low. I reset the values and timer after about a minute of embed training batch size 2 on NAI (3 doesn't increase performance and 4 will throw a cuda out of memory error). The power really seems to fluctuate wildly, and constantly for me. GPU TWEAK constantly showing GPU power go from 75% down to even 30% sometimes, temp stays around high 50s, peaks at maybe 63C. Can hear the fans going up down up down...
Well, I've lowered the power limit down to 337W, so anything higher than that will be a transient spike.
That's to say, your issue - whatever it is - isn't related to your PSU.
so while im in venv, i pip install to main stable diffusion directory?
@C43H66N12O12S2 Oh, didn't take that limit literally. Foolish of me. I suppose now I'll try turning off all other software in an attempt to get it to behave normally... any chance it's CPU bound? It's only an i5 10400, but it seems to pull maybe 38% on average.
Is the 4090 fully supported in SD?
I am getting the same performance with the 4090 that my 3070 was getting.