Open expenses opened 2 years ago
@expenses hey, did you try to set OMP_NUM_THREADS variable? Something like
export OMP_NUM_THREADS = 8
python stable_diffusion.py ...
@expenses hey, did you try to set OMP_NUM_THREADS variable?
Hmm, doing that doesn't help either. Perhaps there's a hardware reason why I can't use 8 cores for this? I am on a laptop (with a 11th Gen Intel i7-1165G7)
@expenses also you can try CPU_THREADS_NUM or CPU_THROUGHPUT_STREAMS variables 🤔 I'm not sure that this is the hardware problem. In my opinion, this is the problem on openvino side. Also, you can try to create an issue in the OpenVINO repo: https://github.com/openvinotoolkit/openvino/issues
Modify the .py file, add this after self.core = Core()
self.core.set_property("CPU", {"INFERENCE_NUM_THREADS": 8})
You can change 8
to be matching the number of cores in your system.
You can change
8
to be matching the number of cores in your system.
This works in that all eight of my CPU cores go to 100% rather than just four of them, but it doesn't reduce my seconds per iteration at all.
Yeah you're right, this maybe doesn't do exactly what I thought it did - but it looks the most likely parameter from the openvino docs that I could find (docs which, I must add are pretty hard to read and I am not at all familiar with).
On my Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
(8C8T), Using the same prompt and seed, taking a reading after 3 iterations:
INFERENCE_NUM_THREADS: 1
= 14.50 s/it
INFERENCE_NUM_THREADS: 2
= 7.43 s/it
INFERENCE_NUM_THREADS: 3
= 5.28 s/it
INFERENCE_NUM_THREADS: 4
= 4.34 s/it
INFERENCE_NUM_THREADS: 5
= 3.89 s/it
INFERENCE_NUM_THREADS: 6
= 3.58 s/it
INFERENCE_NUM_THREADS: 7
= 3.42 s/it
INFERENCE_NUM_THREADS: 8
= 3.31 s/it
I can see with each increment that an additional core is being used, but this is clearly not scaling linearly.
If I omit the config completely, it uses all 8 cores at 3.36 s/it.
Same test on my laptop Intel(R) Core(TM) i7-11800H @ 2.30GHz
(8C16T)
INFERENCE_NUM_THREADS: 1
= 15.11 s/it
INFERENCE_NUM_THREADS: 2
= 7.99 s/it
INFERENCE_NUM_THREADS: 3
= 5.84 s/it
INFERENCE_NUM_THREADS: 4
= 4.43 s/it
INFERENCE_NUM_THREADS: 5
= 3.79 s/it
INFERENCE_NUM_THREADS: 6
= 3.40 s/it
INFERENCE_NUM_THREADS: 7
= 3.12 s/it
INFERENCE_NUM_THREADS: 8
= 2.84 s/it
INFERENCE_NUM_THREADS: 9
= 4.13 s/it
INFERENCE_NUM_THREADS: 10
= 3.87 s/it
INFERENCE_NUM_THREADS: 11
= 3.68 s/it
INFERENCE_NUM_THREADS: 12
= 3.29 s/it
INFERENCE_NUM_THREADS: 13
= 3.20 s/it
INFERENCE_NUM_THREADS: 14
= 3.12 s/it
INFERENCE_NUM_THREADS: 15
= 3.07 s/it
INFERENCE_NUM_THREADS: 16
= 2.89 s/it
google colab default = 32 s/it INFERENCE_NUM_THREADS: 0 = 30 s/it INFERENCE_NUM_THREADS: 1 = 37 s/it INFERENCE_NUM_THREADS: 20 = 52 s/it
{"INFERENCE_NUM_THREADS": 16}
gave me half a second of speed boost per iteration on AMD Ryzen 7 3700X
, running at 4 s/it @ 4 GHz.
This is quite a lot slower than an Intel apparently, but I guess this is to be expected...
https://www.kaggle.com/code/lostgoldplayer/cpu-stable-diffusion it takes 2 minutes for one image
Modify the .py file, add this after
self.core = Core()
self.core.set_property("CPU", {"INFERENCE_NUM_THREADS": 8})
You can change
8
to be matching the number of cores in your system.
How to do this within demo.py? Can you post an example using to setup this?
@Neowam no need to post the entire script...
Line 29 in stable_diffusion_engine.py
with the latest version.
Around 3-3.5s/it on a 3800X. Almost as fast as running on a 5700XT with DirectML!
Ryzen 5600X : defaullt = 4.33s/it
INFERENCE_NUM_THREADS: 12 = 3.7s/it INFERENCE_NUM_THREADS: 10 = 3.8s/it INFERENCE_NUM_THREADS: 8 = 4.22s/it INFERENCE_NUM_THREADS: 6 = 4.16s/it
Love being able to run this on CPU!
Intel i5-12600 (linux)
I am using:
self.core.set_property("CPU", {"CPU_BIND_THREAD": "NUMA"})
It will works with all compatible cpus.
I kind of assume that OpenVINO uses CPU features / instructions that are only available once per core.
Also, keep in mind that half of the threads are "just" hyperthreading, leveraging the fact that CPUs waiting for IO most of the time. In contrast NN inference is mostly CPU bound so it maxes a core out without any gap to squeeze in more instructions while the CPU is waiting.
Exactly, using self.core.set_property("CPU", {"CPU_BIND_THREAD": "NUMA"})
uses all the cores but not the hyperthreading ones and I get using it the maximum speed after some tests with others threading options, aroung 3.30s/it.
For some tasks hyperthreading is not useful.
Maybe it should be added to the code.
AMD Ryzen 5 2400GE (8T4C, 3200MHz):
4 threads: 21.16s/it 8 threads: 14.66s/it
Intel Xeon 2670(3), 1: 23.75s/it 2: 7.22s/it 4: 6.57s/it ... 10: 5.02s/it but 12: 5.66s/it 14: 5.76s/it ... 24: 6.95s/it NUMA: 6.1s/it Under 50% util every test. Why?
Btw, i bought one more mem card, so now there is 2 of them. And 2-chanal is great(3,2s/it). So the problem was a mem I/O.
Hmm, it seems using only 12 threads on a Ryzen 5900X already maxes performance? Upping from 12 to 24 threads does make my CPU usage go up to 100% utilized, but the speed barely increases at all. I guess hyperthreading on AMD isn't useful at all for this workload?
Ryzen 5900X: INFERENCE_NUM_THREADS on 12, I get 2.42s/it INFERENCE_NUM_THREADS on 24, I get 2.39s/it
Anyone know what the bottleneck is here? Running on NVMe SSD and 2666Mhz RAM.
It seems that hyperthreading isn't enabled by default. You have to enable it using the property:
compiled_model = core.compile_model(
model=model,
device_name=device_name,
config={properties.hint.enable_hyper_threading(): True},
)
Hi! This is a really cool piece of work, seems to run approx 2x faster than a native Torch CPU implementation. I did notice that it only uses 4 of the 8 threads on my machine though. I'm new to openvino; is there a way to configure how many threads are used?