bes-dev / stable_diffusion.openvino

Apache License 2.0
1.54k stars 205 forks source link

Only 4 threads seem to be used on an 8 thread machine. #10

Open expenses opened 2 years ago

expenses commented 2 years ago

Hi! This is a really cool piece of work, seems to run approx 2x faster than a native Torch CPU implementation. I did notice that it only uses 4 of the 8 threads on my machine though. I'm new to openvino; is there a way to configure how many threads are used?

bes-dev commented 2 years ago

@expenses hey, did you try to set OMP_NUM_THREADS variable? Something like

export OMP_NUM_THREADS = 8
python stable_diffusion.py ...
expenses commented 2 years ago

@expenses hey, did you try to set OMP_NUM_THREADS variable?

Hmm, doing that doesn't help either. Perhaps there's a hardware reason why I can't use 8 cores for this? I am on a laptop (with a 11th Gen Intel i7-1165G7)

bes-dev commented 2 years ago

@expenses also you can try CPU_THREADS_NUM or CPU_THROUGHPUT_STREAMS variables 🤔 I'm not sure that this is the hardware problem. In my opinion, this is the problem on openvino side. Also, you can try to create an issue in the OpenVINO repo: https://github.com/openvinotoolkit/openvino/issues

LouDou commented 2 years ago

Modify the .py file, add this after self.core = Core()

self.core.set_property("CPU", {"INFERENCE_NUM_THREADS": 8})

You can change 8 to be matching the number of cores in your system.

benplumley commented 2 years ago

You can change 8 to be matching the number of cores in your system.

This works in that all eight of my CPU cores go to 100% rather than just four of them, but it doesn't reduce my seconds per iteration at all.

LouDou commented 2 years ago

Yeah you're right, this maybe doesn't do exactly what I thought it did - but it looks the most likely parameter from the openvino docs that I could find (docs which, I must add are pretty hard to read and I am not at all familiar with).

On my Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz (8C8T), Using the same prompt and seed, taking a reading after 3 iterations:

INFERENCE_NUM_THREADS: 1 = 14.50 s/it INFERENCE_NUM_THREADS: 2 = 7.43 s/it INFERENCE_NUM_THREADS: 3 = 5.28 s/it INFERENCE_NUM_THREADS: 4 = 4.34 s/it INFERENCE_NUM_THREADS: 5 = 3.89 s/it INFERENCE_NUM_THREADS: 6 = 3.58 s/it INFERENCE_NUM_THREADS: 7 = 3.42 s/it INFERENCE_NUM_THREADS: 8 = 3.31 s/it

I can see with each increment that an additional core is being used, but this is clearly not scaling linearly.

If I omit the config completely, it uses all 8 cores at 3.36 s/it.

LouDou commented 2 years ago

Same test on my laptop Intel(R) Core(TM) i7-11800H @ 2.30GHz (8C16T)

INFERENCE_NUM_THREADS: 1 = 15.11 s/it INFERENCE_NUM_THREADS: 2 = 7.99 s/it INFERENCE_NUM_THREADS: 3 = 5.84 s/it INFERENCE_NUM_THREADS: 4 = 4.43 s/it INFERENCE_NUM_THREADS: 5 = 3.79 s/it INFERENCE_NUM_THREADS: 6 = 3.40 s/it INFERENCE_NUM_THREADS: 7 = 3.12 s/it INFERENCE_NUM_THREADS: 8 = 2.84 s/it INFERENCE_NUM_THREADS: 9 = 4.13 s/it INFERENCE_NUM_THREADS: 10 = 3.87 s/it INFERENCE_NUM_THREADS: 11 = 3.68 s/it INFERENCE_NUM_THREADS: 12 = 3.29 s/it INFERENCE_NUM_THREADS: 13 = 3.20 s/it INFERENCE_NUM_THREADS: 14 = 3.12 s/it INFERENCE_NUM_THREADS: 15 = 3.07 s/it INFERENCE_NUM_THREADS: 16 = 2.89 s/it

breadbrowser commented 2 years ago

google colab default = 32 s/it INFERENCE_NUM_THREADS: 0 = 30 s/it INFERENCE_NUM_THREADS: 1 = 37 s/it INFERENCE_NUM_THREADS: 20 = 52 s/it

panki27 commented 2 years ago

{"INFERENCE_NUM_THREADS": 16} gave me half a second of speed boost per iteration on AMD Ryzen 7 3700X, running at 4 s/it @ 4 GHz. This is quite a lot slower than an Intel apparently, but I guess this is to be expected...

breadbrowser commented 2 years ago

https://www.kaggle.com/code/lostgoldplayer/cpu-stable-diffusion it takes 2 minutes for one image

Neowam commented 2 years ago

Modify the .py file, add this after self.core = Core()

self.core.set_property("CPU", {"INFERENCE_NUM_THREADS": 8})

You can change 8 to be matching the number of cores in your system.

How to do this within demo.py? Can you post an example using to setup this?

panki27 commented 2 years ago

@Neowam no need to post the entire script...

Line 29 in stable_diffusion_engine.py with the latest version.

jdluzen commented 2 years ago

Around 3-3.5s/it on a 3800X. Almost as fast as running on a 5700XT with DirectML!

trash-cant commented 2 years ago

Ryzen 5600X : defaullt = 4.33s/it

INFERENCE_NUM_THREADS: 12 = 3.7s/it INFERENCE_NUM_THREADS: 10 = 3.8s/it INFERENCE_NUM_THREADS: 8 = 4.22s/it INFERENCE_NUM_THREADS: 6 = 4.16s/it

Love being able to run this on CPU!

rncar commented 2 years ago

Intel i5-12600 (linux)

I am using: self.core.set_property("CPU", {"CPU_BIND_THREAD": "NUMA"})

It will works with all compatible cpus.

fhaust commented 2 years ago

I kind of assume that OpenVINO uses CPU features / instructions that are only available once per core.

Also, keep in mind that half of the threads are "just" hyperthreading, leveraging the fact that CPUs waiting for IO most of the time. In contrast NN inference is mostly CPU bound so it maxes a core out without any gap to squeeze in more instructions while the CPU is waiting.

rncar commented 2 years ago

Exactly, using self.core.set_property("CPU", {"CPU_BIND_THREAD": "NUMA"}) uses all the cores but not the hyperthreading ones and I get using it the maximum speed after some tests with others threading options, aroung 3.30s/it.

For some tasks hyperthreading is not useful.

Maybe it should be added to the code.

dcz-self commented 2 years ago

AMD Ryzen 5 2400GE (8T4C, 3200MHz):

4 threads: 21.16s/it 8 threads: 14.66s/it

Sogvehz commented 2 years ago

Intel Xeon 2670(3), 1: 23.75s/it 2: 7.22s/it 4: 6.57s/it ... 10: 5.02s/it but 12: 5.66s/it 14: 5.76s/it ... 24: 6.95s/it NUMA: 6.1s/it Under 50% util every test. Why?

Sogvehz commented 2 years ago

Btw, i bought one more mem card, so now there is 2 of them. And 2-chanal is great(3,2s/it). So the problem was a mem I/O.

Seegee commented 1 year ago

Hmm, it seems using only 12 threads on a Ryzen 5900X already maxes performance? Upping from 12 to 24 threads does make my CPU usage go up to 100% utilized, but the speed barely increases at all. I guess hyperthreading on AMD isn't useful at all for this workload?

Ryzen 5900X: INFERENCE_NUM_THREADS on 12, I get 2.42s/it INFERENCE_NUM_THREADS on 24, I get 2.39s/it

Anyone know what the bottleneck is here? Running on NVMe SSD and 2666Mhz RAM.

dbalabka commented 1 year ago

It seems that hyperthreading isn't enabled by default. You have to enable it using the property:

compiled_model = core.compile_model(
    model=model,
    device_name=device_name,
    config={properties.hint.enable_hyper_threading(): True},
)

https://docs.openvino.ai/2023.1/openvino_docs_OV_UG_supported_plugins_CPU.html#multi-threading-optimization