[Feature Request]: CPU optimizations

ghost commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

As per #3300 discussion, I think some optimizations for running SD on the CPU is possible, doesn't have to be major but minor improvements will benefit those that have a powerful CPU but an old GPU that isn't capable of running SD.

According to this article running SD on the CPU can be optimized, stable_diffusion.openvino being slightly slower than running SD on the Ryzen iGPU.

One one of the forks, I found DDIM sampler support. Drake53/stable_diffusion.openvino@a56987c Very nice.. it should produce decent results at 8 steps.

Quoting the discussion to show some optimizations are possible.

Proposed workflow

Add --skip-torch-cuda-test --use-cpu all to webui-user
Run webui
Hope 40 sampling steps doesn't take 8 minutes to finish
Enjoy

Additional information

N/A

aifartist commented 2 years ago

I hope to have the new Dell Alienware R15 soon with the Intel Raptor Lake. Does the torch or cuda for CPU make use of all the hardware threads/cores?

ghost commented 2 years ago

Does the torch or cuda for CPU make use of all the hardware threads/cores?

No, one of my machines with a Ryzen 5825U (8C/16T) and running SD on the CPU with 50% usage and about 6-7GB of RAM.

I think it's because SD is designed with Nvidia's CUDA in mind so it's not optimized in any way to better use the CPU, openvino and several other forks have some optimizations to speed up SD when running on a CPU, hopefully these can be implemented here so it can be used with the user friendly web UI.

CodeExplode commented 2 years ago

CPUs in general aren't as good as the type of math calculations which machine learning uses as GPUs are, which is why they're all made for CUDA. That being said I wouldn't mind pushing my machine harder to use both the CPU and GPU for batches if it were possible, and to offload generation of sample images during training to the CPU, where speed doesn't matter so much.

ghost commented 2 years ago

I just wanted to add that PLMS sampler isn't working for me in cpu mode.

also for older cards like my gt730 I found the following settings are required:

set COMMANDLINE_ARGS=--precision full --no-half --use-cpu all --skip-torch-cuda-test set CUDA_VISIBLE_DEVICES=-1

ghost commented 2 years ago

I just wanted to add that PLMS sampler isn't working for me in cpu mode.

DDIM doesn't work either. Should share our findings so a CPU section on the wiki can be added so everyone knows what limitations using the CPU has and maybe it could be optimized or fixed later on.

Khyta commented 1 year ago

@tkalayci71 where did you put those settings? Is it in the webui-user.sh file?

ghost commented 1 year ago

@tkalayci71 where did you put those settings? Is it in the webui-user.sh file?

webui-user.bat for windows. I assume for linux it would be webui-user.sh:

export COMMANDLINE_ARGS="--precision full --no-half --use-cpu all --skip-torch-cuda-test" export CUDA_VISIBLE_DEVICES="-1" (not sure about quotes here)

RNG42 commented 1 year ago

@ghost573 running SD on the CPU with 50% usage

my cpu gets 100% utillized by setting the threads torch uses to the actual available number. iteration time got reduced by 4s. i just added the following two lines to /modules/devices.py below cpu = torch.device("cpu"):

torch.set_num_threads(<number of threads>) torch.set_num_interop_threads(<number of threads>)

where <number of threads> is the logical processor count. would be more comfortable if this could be set via launch parameter.

ghost commented 1 year ago

@RNG42 been testing this and maybe it's an AMD issue but I get worse performance when I add that. It goes from 13it/s to 15it/s, my CPU utilization goes up but for some reason my performance decreases.

ghost commented 1 year ago

@RNG42 I tried it with the same seed, prompts and settings, I get worse performance strangely. Might just be an AMD issue.

RNG42 commented 1 year ago

@ghost573 there are no AMD cpu issues on my end. everything works as expected

ghost commented 1 year ago

@RNG42 strange, on my 5825U, my performance degrades with that tweak, I found it's slightly faster without it. Just curious, if you're using linux, are you using p-state or acpi?

RNG42 commented 1 year ago

using windows, high performance power plan. no other special settings used. to what number did you set the threads?

ghost commented 1 year ago

16 but that might explain the differences. Using linux, not windows.

RNG42 commented 1 year ago

you might need to set these environment variables on linux to get it working properly. copy the following lines into webui-user.sh: export OMP_NUM_THREADS=16 export MKL_NUM_THREADS=16

ghost commented 1 year ago

Well, I tried it and performance really is worse. Goes from 12-13s/it to 14-15s/it

0xBYTESHIFT commented 1 year ago

It goes from 13it/s to 15it/s @ghost573 might be a simply throttling issue, check frequencies of cores

SSeckie commented 1 year ago

torch.set_num_threads(<number of threads>) torch.set_num_interop_threads(<number of threads>)

I tested different thread counts with my i5-11400:

2 - 5.77s/it 4 - 4.46s/it 6 - 4.12s/it 8 - 4.65s/it 12- 4.83s/it

20 Steps, 384x384, DPM++ 2M Karras

It seems like setting it to half the total threads gives the best performance?

ClashSAN commented 1 year ago

It seems like setting it to half the total threads gives the best performance?

I've come to that conclusion, it just seems to be this generic fact. torch onnx openvino all use half your cpu cores by default and it seems to be optimal, rather than using an increased number of cores and getting slightly faster, sometimes slower speeds.

juliosardinha commented 1 year ago

@ghost573 running SD on the CPU with 50% usage

my cpu gets 100% utillized by setting the threads torch uses to the actual available number. iteration time got reduced by 4s. i just added the following two lines to /modules/devices.py below cpu = torch.device("cpu"):

torch.set_num_threads(<number of threads>) torch.set_num_interop_threads(<number of threads>)

where <number of threads> is the logical processor count. would be more comfortable if this could be set via launch parameter.

Hey! Thank you! This solved my problem and I can confirm that on Intel Core i9 the optimized number of threads is 10, which is the half of available threads.

I was trying to use a GPU (AMD Radeon RX 580 8 GB) but something in macOS Ventura makes the GPU crash (black screen and fans ramp up to 100%).

So, disabling the GPU and using those 2 lines of code saved me.

zaanind commented 1 year ago

i'm using colab(cpu) can't use gpu sometime , i don't know why.. so i adedd this - export COMMANDLINE_ARGS="--precision full --no-half --use-cpu all --skip-torch-cuda-test" export CUDA_VISIBLE_DEVICES="-1" to /content/stable-diffusion-webui/webui-user.sh but i got error - Commit hash: 4af3ca5393151d61363c30eef4965e694eeac15e Installing requirements for Web UI

Launching Web UI with arguments: --share --xformers --enable-insecure-extension-access Warning: caught exception 'No CUDA GPUs are available', memory monitor disabled LatentDiffusion: Running in eps-prediction mode Traceback (most recent call last): File "launch.py", line 299, in start() File "launch.py", line 292, in start webui.webui() File "/content/stable-diffusion-webui/webui.py", line 133, in webui initialize() File "/content/stable-diffusion-webui/webui.py", line 63, in initialize modules.sd_models.load_model() File "/content/stable-diffusion-webui/modules/sd_models.py", line 312, in load_model sd_model = instantiate_from_config(sd_config.model) File "/content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/util.py", line 79, in instantiate_from_config return get_obj_from_str(config["target"])(*config.get("params", dict())).cuda().cuda().cuda().cuda().cuda().cuda().cuda().cuda().cuda().cuda().cuda() File "/content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 550, in init super().init(conditioning_key=conditioning_key, args, kwargs) File "/content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 92, in init self.model = DiffusionWrapper(unet_config, conditioning_key) File "/content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 1314, in init self.diffusion_model = instantiate_from_config(diff_model_config) File "/content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/util.py", line 79, in instantiate_from_config return get_obj_from_str(config["target"])(config.get("params", dict())).cuda().cuda().cuda().cuda().cuda().cuda().cuda().cuda().cuda().cuda().cuda() File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 749, in cuda return self._apply(lambda t: t.cuda(device)) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 664, in _apply param_applied = fn(param) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 749, in return self._apply(lambda t: t.cuda(device)) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 229, in _lazy_init torch._C._cuda_init() RuntimeError: No CUDA GPUs are available

digital-pers0n commented 1 year ago

torch.set_num_threads(<number of threads>) torch.set_num_interop_threads(<number of threads>)

Core i7 Mac Mini (2018). 512x512, DPM++ SDE Karras, 11 Steps. Setting these parameters to 6 reduces generation time from ~11 minutes to 4-5

AUTOMATIC1111 / stable-diffusion-webui