[Bug]: Abysmal performance on RTX 4090

levicki commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I have seen a few issues discussing poor RTX 4090 performance, but none of them have any resolution.

I just did a clean install of the repository few minutes ago trying to fix it, and I am getting only ~7.5 it/s with default settings (Euler-a, 20 iterations, batch size 1, 512x512px, sd-v1-4-full-ema.ckpt).

I think there might have been some performance regression, and I just can't believe that nobody working on this project has a 40x0 series card to test before pushing a commit.

Steps to reproduce the problem

Do a fresh install
Test image generation on an RTX 4090 without xformers using default parameters mentioned above 4 times
Discard first image it/s result
Observe pathetic ~7.5 it/s for the next 3 images

What should have happened?

Iteration speed of at least 25 it/s without xformers.

Commit where the problem happens

602a1864b05075ca4283986e6f5c7d5bce864e11

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Brave

Command Line Arguments

No response

Additional information, context and logs

No response

BrentOzar commented 1 year ago

So there are existing issues without answers, and you decided the right thing to do was … add yet another issue?

levicki commented 1 year ago

So there are existing issues without answers, and you decided the right thing to do was … add yet another issue?

I did because the issue I found was a question, and this is a bug report as the performance has regressed to about half it was before (and sadly I can't tell you how much before, or I would have just rolled back to that commit already).

For example #2449 in particular has over 200 posts and a lot of people asking whether procedure described in one of those posts can be also applied to Windows install (I am running this on Windows).

Linked from there is a pull request #7056, which suggests increasing torch and CUDA version, but it seems to break Dreambooth which is unacceptable. Despite it seemingly not being approved by @AUTOMATIC1111 I see that launch.py change in the commit I linked here in the bug report and which gives me the abysmal speed noted above.

Given the difficulty Windows users have with building Torch (and its dependencies) it would be very nice if 4000 series got some attention -- upgrading to Torch 2.0 and CUDA 12.0 (or at least 11.8) would be a reasonable first step towards allowing those cards to reach their full potential.

levicki commented 1 year ago

Let me add that just replacing the CUDNN binaries in venv/lib/site-packages/torch/lib (1.13.1+cu117) with the latest ones (v8.7.0 for CUDA 11.x) brings the speed up from ~7.5 it/s to ~11 it/s for me but this is still slower than before and slower than 3000 series cards.

XeonG commented 1 year ago

Never knew it could be faster only started trying this stuff with the 4090 in the last week.. hopefully it does get back to what the hardware is capable of soon.

will try the 8.7.0 files

levicki commented 1 year ago

Of course it can be faster — 4090 should have double the performance of the 3090 Ti.

ChinatsuHS commented 1 year ago

Few things you can do to benchmark better is

roll back the commits to before the slowdowns happened with current nvidia drivers.. are the slowdowns still the same then it is the nvidia drivers most likely
rollback nvidia drivers to earlier versions ... are the slowdowns still happening .. try using nvtop to see if anything else is eating away at the vram while generating.
try using studio drivers instead of game ready drivers.

seoeaa commented 1 year ago

I have a speed on 4090 7.5 it / s, I managed to double it, but the training broke

javsezlol1 commented 1 year ago

@seoeaa howd you manage to double it?

levicki commented 1 year ago

@ChinatsuHS

roll back the commits to before the slowdowns happened with current nvidia drivers.. are the slowdowns still the same then it is the nvidia drivers most likely

As I said, if I knew how far I have to roll back I would have already done it. I only noticed the slowdown with this latest commit but when I started using this repository approx. 2 months ago I had at least 15 it/s. Who knows which of the dozens of commits made since then could have caused it.

If you could give me a couple of commit hashes which introduced changes that are likely to cause slowdowns, then I could rollback to those specific commits, but I am not going to rollback through dozens of them one by one to find it — there must be a better way.

rollback nvidia drivers to earlier versions ... are the slowdowns still happening .. try using nvtop to see if anything else is eating away at the vram while generating.

Rolling back NVIDIA drivers is not an option — I am on 528.02, and the lowest I could go is 526.98 without breaking other things which I use (Daz Studio and Iray). Furthermore, I am sure nothing else is using VRAM when I am running SD.

try using studio drivers instead of game ready drivers.

Already using Studio Driver because of aforementioned Daz Studio and Iray. No slowdowns there so I highly doubt it is the driver causing it. I can upgrade the driver to the freshly released 528.24 and retest, but frankly I don't expect any changes in performance since it is mostly a bugfix for Adobe creative applications.

@seoeaa @javsezlol1 Gentlemen please take your discussion on possible workarounds to the issue #2449 where those were already being discussed (or create a new issue if you wish) — I would appreciate if the level of noise here is kept to the minimum so that developers can focus on my bug report and that I can focus on providing them with additional info without all of us having to wade through dozens of offtopic posts. Thanks in advance for your understanding.

levicki commented 1 year ago

@AUTOMATIC1111

OK, here is what and how I tested:

Rolled back to NVIDIA Studio Driver version 526.98 (November 16th, 2022).
Reinstalled the repository by deleting all files, cloning, and checking out the commit 7ba9bc2fdbfae8115294962510492faafeb48573 (December 19th, 2022).

That combination gives me ~9.25 it/s instead of ~7.5 it/s.

Iteration speed I am reporting is the average of 3 consecutive runs with default settings, identical prompt and seed. The result from the first warmup run is discarded.

Updating the repository (git pull origin master) to as of this writing latest commit 9beb794e0b0dc1a0f9e89d8e38bd789a8c608397 with requisite COMMANDLINE_ARGS=--reinstall-torch yields ~7.51 it/s.

Reverting back to NVIDIA Studio Driver 522.25 (October 12, 2022) which is the first available driver for 4000 series cards while staying on the latest commit yields ~7.80 it/s.

I have also tried randomly picked older commit (47a44c7e421b98ca07e92dbf88769b04c9e28f86), both without and with downgrading torch and torchvision — torch==1.12.1+cu113 torchvision==0.13.1+cu113 definitely results in extra 2 iterations per second compared to the torch==1.13.1+cu117 torchvision==0.14.1+cu117 on my 4090 RTX irregardless of the NVIDIA driver used.

After that I reinstalled again and reverted to the latest commit before the torch upgrade (commit 59146621e256269b85feb536edeb745da20daf68) — with torch==1.12.1+cu113 torchvision==0.13.1+cu113 and with replacing the CUDNN binaries in venv/lib/site-packages/torch/lib with the latest ones (v8.7.0 for CUDA 11.x) I am getting ~12.64 it/s, and I know that this card should be capable of at least double that number.

Please consider upgrading to torch 2.0 or at least revert the change to cu117 because it degrades performance. Also, please consider upgrading CUDNN binaries that are coming with torch if possible.

Juginchi commented 1 year ago

Not sure if this solution still applies but these people are all talking about 20+ it/s with their 4090s last month: https://www.reddit.com/r/StableDiffusion/comments/y71q5k/4090_cudnn_performancespeed_fix_automatic1111/

levicki commented 1 year ago

@Juginchi They talk about replacing CUDNN binaries which brings you to ~12 it/s, which, if you read my post above, I already did.

I didn't mess with xformers because their output was non-deterministic (i.e. they did not produce same image with the same prompt and seed). I am not sure if that has changed in the meantime, but my main problem is that current torch version is making SD run slower on 4000 series cards.

EDIT: I just tested (still on commit 59146621e256269b85feb536edeb745da20daf68), and I can get ~15 it/s with xformers enabled, but the images from the same seed and prompt are different. Maybe that is fixed in later commits, but they degrade performance for me as explained above.

levicki commented 1 year ago

Further testing:

I installed torch 2.0+cu118 and torchvision 0.15.0+cu118, replaced CUDNN libraries with the latest ones, and now I am getting 14 it/s without xformers.

For those wanting to try this, you can edit launch.py in the repo root to look like this:

    torch_command = os.environ.get('TORCH_COMMAND', "pip install torch==2.0.0.dev20230128+cu118 torchvision==0.15.0.dev20230128+cu118 --index-url https://download.pytorch.org/whl/nightly/cu118")

Before launching, activate the environment and manually reinstall torch first:

venv\scripts\activate
pip install --force-reinstall torch==2.0.0.dev20230128+cu118 torchvision==0.15.0.dev20230128+cu118 --index-url https://download.pytorch.org/whl/nightly/cu118

Then overwrite CUDNN binaries in venv/lib/site-packages/torch/lib with the latest ones (v8.7.0 for CUDA 11.x from NVIDIA.

Disclaimer: I did not test whether this change affects reproducibility of the old seeds and prompts, try it at your own risk!

Creative-Ataraxia commented 1 year ago

I'm only getting 1.2 it/s when training on the 4090, this can't be normal can it?

ChinatsuHS commented 1 year ago

apparently there is a discord bug that causes Nvidia gpu's to not reach the correct clockspeeds..

maybe that explains the sudden reduction of it/s with generating/training

(is getting fixed by discord)

so guess you have to have the discord app/site closed if you want full performance

levicki commented 1 year ago

@ChinatsuHS I am not running Discord so no that doesn't explain it.

Also, in my opinion, this project is moving too fast for its own good — they should slow down with adding new features and focus on fixing bugs and performance regressions for a bit.

MproductionMaddue commented 1 year ago

I'm only getting 1.2 it/s when training on the 4090, this can't be normal can it?

It is exactly the same for me and I feel like I am missing out on a fix but I cant find anything useful out there. I found this comment on reddit but I have no idea what to do on step 4 /5/6. the directory in step 4 venv\Scrips\activate pip doenst exist for me. it ends in the Scripts folder. Maybe I am too much of a noob to know what a pip is...

"1 git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git

2 edit launch.py: replace torch_command = os.environ.get('TORCH_COMMAND', "pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113") with torch_command = os.environ.get('TORCH_COMMAND', "pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116") run web-user.bat

3 download cuda files from https://developer.download.nvidia.com/compute/redist/cudnn/v8.6.0/local_installers/11.8/ copy .dll files from the "bin" folder in that zip file, replace the ones in "stable-diffusion-main\venv\Lib\site-packages\torch\lib"

4 download file locally from: https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases/download/d/xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl copy xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl file to root SD folder venv\Scrips\activate pip install xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl

5 add --xformers to web-user.bat command arguments

6 add model run webui-user.bat

7 other things: used firefox with hardware acceleration disabled in settings on previous attempts I also tried --opt-channelslast --force-enable-xformers but in this last run i got 28it/s without them for some reason

Results, default settings, empty prompt:

batch of 8: best: 3.54it/s (28.32it/s), typical 3.45 (27.6it/s)

single image: best 22.60it/s average: 19.50it/s

system: RTX 4090, Ryzen 3950x, 64GB 3600Mhz, M2 NVME"

clapbr commented 1 year ago

for me it's like this on my Ryzen 5 2600 (soon to be upgraded) + RTX 3090 32GB RAM @ 3000Mhz:

Linux with xformers = 18it/s
Windows 11 with xformers and Hardware Accelerated GPU Scheduling ON = 10it/s
Windows 11 with xformers and Hardware Accelerated GPU Scheduling OFF = 14it/s

That big offset only happens with batch size of 1, increasing it on Windows makes it perform closer. GPU usage also stays low with batch size 1.

Latest drivers on both OS.

levicki commented 1 year ago

I am closing this issue because:

Apparently devs won't even consider looking at it as they seem to be chasing the next shiny feature to add to the GUI even if the final result ends up being unusable due to the overall slowness of both inference and training on the latest hardware.
It attracts random inexperienced users who post their results and workarounds without understanding what they are doing and without reproducible testing methodology which decreases signal to noise ratio and makes this issue even less likely to be addressed by devs -- previous issue similar to this one has over 200 comments and it didn't get any attention.

aristotaloss commented 1 year ago

@levicki 2.5 months later and your foreboding message became the truth: this project has become the ashes of its own heat. Absolutely unusable, as seems to be the case with anything written in Python.

levicki commented 1 year ago

@Velocity- From my experience, Python is mostly used by scientists who understand domain specific stuff, but lack the proper software engineering discipline.

Anyway, you can't really claim that "anything written in Python is unusable" -- there is Ren'py which is quite usable and a shining example of a good program written in Python.

Language is never the problem because every language is good for something (Python is good for AI/ML because of torch).

I got my share of fun out of this project for free so I am thankful to authors and contributors who made this repo. I just wish they took a more structured approach and kept steadily improving it instead of always chasing the latest gimmick and breaking other stuff that worked in the process.

aristotaloss commented 1 year ago

@levicki The hyperbole "anything written in Python is unusable" should be read as "almost anything written in Python is unusable". I have 4 Python processes running on my desktop -- none of which have gotten in my way. networkd-dispatcher and others.

Yet almost everything that uses Python falls victim to Python. Python is awful language design and the AI community would have been further ahead were it not for the chosen foundation. Every project I bump into is filled to the brim with issues, crashes, and people having absolutely no idea what's going on. Python is a good tool for something quick: parsers, scrapers, cli apps, etc. But its current use is far beyond what it can handle.

Give any newcomer to programming the advice "try Python" and you'll find them struggling hopelessly with indentation the days or weeks that follow. Is it "easy" because it is the closest to pseudocode?

So I respectfully disagree, language is often the problem, and PHP would love to chime in if it wasn't for the shackles around his legs.

levicki commented 1 year ago

@Velocity- I wouldn't want to turn this into a discussion about languages, but here is what I think.

Python as a language is perfectly fine.

Yes, I don't like the indentation rules and consider them archaic (compiler should ignore whitespace unless it is part of a string literal).

However, Python does support object oriented programming and I have seen quite a lot of neat and clean Python code. On top of that, Python has important widely used libraries for number crunching and ML (numpy, torch, etc) so it is a tool of choice for scientific projects.

PHP is the language I wrote my website and its minimal CMS in from scratch when I decided to have a web presence back in 2007 even though I never used PHP before that (I have ASM/C/C++ background). PHP also supports object oriented programming and I have seen a lot of neat and clean PHP code as well.

You can say that C# is a better designed language, but I have seen inexperienced developers write equally bad code in C# as others do in Python and PHP.

I stand by my assesment that the reason any language suffers is because of the demographics that is using them.

Just look at how much natural languages such as English have been distorted, and how many words have changed their meaning as of late because of so many people using them incorrectly -- programming languages are no different in that regard.

So no, it's never the fault of the language but of the user and their capacity (and desire) for learning and understanding how to use it properly.

AUTOMATIC1111 / stable-diffusion-webui