Open ghost opened 2 years ago
I hope to have the new Dell Alienware R15 soon with the Intel Raptor Lake. Does the torch or cuda for CPU make use of all the hardware threads/cores?
Does the torch or cuda for CPU make use of all the hardware threads/cores?
No, one of my machines with a Ryzen 5825U (8C/16T) and running SD on the CPU with 50% usage and about 6-7GB of RAM.
I think it's because SD is designed with Nvidia's CUDA in mind so it's not optimized in any way to better use the CPU, openvino and several other forks have some optimizations to speed up SD when running on a CPU, hopefully these can be implemented here so it can be used with the user friendly web UI.
CPUs in general aren't as good as the type of math calculations which machine learning uses as GPUs are, which is why they're all made for CUDA. That being said I wouldn't mind pushing my machine harder to use both the CPU and GPU for batches if it were possible, and to offload generation of sample images during training to the CPU, where speed doesn't matter so much.
I just wanted to add that PLMS sampler isn't working for me in cpu mode.
also for older cards like my gt730 I found the following settings are required:
set COMMANDLINE_ARGS=--precision full --no-half --use-cpu all --skip-torch-cuda-test set CUDA_VISIBLE_DEVICES=-1
I just wanted to add that PLMS sampler isn't working for me in cpu mode.
DDIM doesn't work either. Should share our findings so a CPU section on the wiki can be added so everyone knows what limitations using the CPU has and maybe it could be optimized or fixed later on.
@tkalayci71 where did you put those settings? Is it in the webui-user.sh file?
@tkalayci71 where did you put those settings? Is it in the webui-user.sh file?
webui-user.bat for windows. I assume for linux it would be webui-user.sh:
export COMMANDLINE_ARGS="--precision full --no-half --use-cpu all --skip-torch-cuda-test" export CUDA_VISIBLE_DEVICES="-1" (not sure about quotes here)
@ghost573 running SD on the CPU with 50% usage
my cpu gets 100% utillized by setting the threads torch uses to the actual available number. iteration time got reduced by 4s. i just added the following two lines to /modules/devices.py
below cpu = torch.device("cpu")
:
torch.set_num_threads(<number of threads>)
torch.set_num_interop_threads(<number of threads>)
where <number of threads>
is the logical processor count. would be more comfortable if this could be set via launch parameter.
@RNG42 been testing this and maybe it's an AMD issue but I get worse performance when I add that. It goes from 13it/s to 15it/s, my CPU utilization goes up but for some reason my performance decreases.
@RNG42 I tried it with the same seed, prompts and settings, I get worse performance strangely. Might just be an AMD issue.
@ghost573 there are no AMD cpu issues on my end. everything works as expected
@RNG42 strange, on my 5825U, my performance degrades with that tweak, I found it's slightly faster without it. Just curious, if you're using linux, are you using p-state or acpi?
using windows, high performance power plan. no other special settings used. to what number did you set the threads?
16 but that might explain the differences. Using linux, not windows.
you might need to set these environment variables on linux to get it working properly. copy the following lines into webui-user.sh
:
export OMP_NUM_THREADS=16
export MKL_NUM_THREADS=16
Well, I tried it and performance really is worse. Goes from 12-13s/it to 14-15s/it
It goes from 13it/s to 15it/s @ghost573 might be a simply throttling issue, check frequencies of cores
torch.set_num_threads(<number of threads>)
torch.set_num_interop_threads(<number of threads>)
I tested different thread counts with my i5-11400:
2 - 5.77s/it 4 - 4.46s/it 6 - 4.12s/it 8 - 4.65s/it 12- 4.83s/it
20 Steps, 384x384, DPM++ 2M Karras
It seems like setting it to half the total threads gives the best performance?
It seems like setting it to half the total threads gives the best performance?
I've come to that conclusion, it just seems to be this generic fact. torch onnx openvino all use half your cpu cores by default and it seems to be optimal, rather than using an increased number of cores and getting slightly faster, sometimes slower speeds.
@ghost573 running SD on the CPU with 50% usage
my cpu gets 100% utillized by setting the threads torch uses to the actual available number. iteration time got reduced by 4s. i just added the following two lines to
/modules/devices.py
belowcpu = torch.device("cpu")
:
torch.set_num_threads(<number of threads>)
torch.set_num_interop_threads(<number of threads>)
where
<number of threads>
is the logical processor count. would be more comfortable if this could be set via launch parameter.
Hey! Thank you! This solved my problem and I can confirm that on Intel Core i9 the optimized number of threads is 10, which is the half of available threads.
I was trying to use a GPU (AMD Radeon RX 580 8 GB) but something in macOS Ventura makes the GPU crash (black screen and fans ramp up to 100%).
So, disabling the GPU and using those 2 lines of code saved me.
i'm using colab(cpu) can't use gpu sometime , i don't know why.. so i adedd this - export COMMANDLINE_ARGS="--precision full --no-half --use-cpu all --skip-torch-cuda-test" export CUDA_VISIBLE_DEVICES="-1" to /content/stable-diffusion-webui/webui-user.sh but i got error - Commit hash: 4af3ca5393151d61363c30eef4965e694eeac15e Installing requirements for Web UI
Launching Web UI with arguments: --share --xformers --enable-insecure-extension-access
Warning: caught exception 'No CUDA GPUs are available', memory monitor disabled
LatentDiffusion: Running in eps-prediction mode
Traceback (most recent call last):
File "launch.py", line 299, in
torch.set_num_threads(<number of threads>)
torch.set_num_interop_threads(<number of threads>)
Core i7 Mac Mini (2018). 512x512, DPM++ SDE Karras, 11 Steps. Setting these parameters to 6 reduces generation time from ~11 minutes to 4-5
Is there an existing issue for this?
What would your feature do ?
As per #3300 discussion, I think some optimizations for running SD on the CPU is possible, doesn't have to be major but minor improvements will benefit those that have a powerful CPU but an old GPU that isn't capable of running SD.
Quoting the discussion to show some optimizations are possible.
Proposed workflow
--skip-torch-cuda-test --use-cpu all
towebui-user
webui
Additional information
N/A