alexheretic / glyph-brush

Fast GPU cached text rendering
Apache License 2.0
688 stars 52 forks source link

Optimise lower workloads by not multithreading #127

Closed alexheretic closed 3 years ago

alexheretic commented 3 years ago

This pr aims to eliminate cases where default (multithread=true) performs worse than setting multithread=off. Instead we'll only use multithread code paths if it looks like they'll be significant gains.

Investigation

st_vs_mt benchmark runs through a set of scenarios of 16-1500 unicode glyphs at scales 12px-150px.

There are two areas that can use multithreading, outlining & rasterizing. An early testing result of just mt-outlining against singlethreaded (st) showed the mt outlining isn't providing much value.

``` group mt-outline-only st ----- --------------- -- bench_1500_chars_150px 1.02 45.5±0.17ms 1.00 44.5±0.23ms bench_1500_chars_75px 1.00 17.1±0.17ms 1.00 17.2±0.04ms bench_1500_chars_30px 1.00 5.8±0.03ms 1.08 6.2±0.04ms bench_1500_chars_12px 1.00 2.6±0.02ms 1.15 2.9±0.02ms bench_300_chars_150px 1.03 14.4±0.14ms 1.00 14.0±0.18ms bench_300_chars_75px 1.04 4.6±0.05ms 1.00 4.5±0.01ms bench_300_chars_30px 1.00 1494.0±27.63µs 1.00 1497.0±16.47µs bench_300_chars_12px 1.00 727.8±11.16µs 1.05 766.0±5.30µs bench_50_chars_150px 1.06 2.7±0.03ms 1.00 2.5±0.00ms bench_50_chars_75px 1.15 903.1±20.58µs 1.00 788.0±6.46µs bench_50_chars_30px 1.27 321.0±6.58µs 1.00 252.2±1.05µs bench_50_chars_12px 1.42 182.5±2.24µs 1.00 128.9±0.74µs bench_16_chars_150px 1.16 895.2±19.25µs 1.00 768.6±7.30µs bench_16_chars_75px 1.35 322.9±4.27µs 1.00 239.6±0.92µs bench_16_chars_30px 1.82 141.0±2.72µs 1.00 77.7±0.31µs bench_16_chars_12px 2.17 88.6±2.55µs 1.00 40.9±0.17µs ```

However, mt drawing is worth it, at least in some cases.

group                     mt                    st
-----                     --                    --
bench_1500_chars_150px    1.00     17.0±0.22ms  2.62     44.5±0.23ms
bench_1500_chars_75px     1.00      6.6±0.07ms  2.60     17.2±0.04ms
bench_1500_chars_30px     1.00      3.2±0.09ms  1.96      6.2±0.04ms
bench_1500_chars_12px     1.00      2.3±0.07ms  1.26      2.9±0.02ms
bench_300_chars_150px     1.00      5.2±0.15ms  2.67     14.0±0.18ms
bench_300_chars_75px      1.00  1824.5±23.38µs  2.44      4.5±0.01ms
bench_300_chars_30px      1.00   881.4±48.28µs  1.70  1497.0±16.47µs
bench_300_chars_12px      1.00   731.5±68.70µs  1.05    766.0±5.30µs
bench_50_chars_150px      1.00  1105.0±40.98µs  2.30      2.5±0.00ms
bench_50_chars_75px       1.00   455.9±32.55µs  1.73    788.0±6.46µs
bench_50_chars_30px       1.00   250.3±29.70µs  1.01    252.2±1.05µs
bench_50_chars_12px       1.67   214.6±32.25µs  1.00    128.9±0.74µs
bench_16_chars_150px      1.00   458.8±37.79µs  1.68    768.6±7.30µs
bench_16_chars_75px       1.00   226.3±48.51µs  1.06    239.6±0.92µs
bench_16_chars_30px       1.80   139.9±30.99µs  1.00     77.7±0.31µs
bench_16_chars_12px       2.90   118.5±19.80µs  1.00     40.9±0.17µs

mt code can be faster. But it can also be slower for small workloads. It's also worth noting that when the performance is the similar we can assume the single thread version will be more power efficient.

Using the tallest glyph height & multiplying by the number of glyphs I calculated a "work magnitude". This can be used to target cases where we expect a decent speedup. I plucked out min magnitude 8742:

group                     mt(>=mag:8742)        st
-----                     --------------        --
bench_1500_chars_150px    1.00     17.1±0.26ms  2.61     44.5±0.23ms
bench_1500_chars_75px     1.00      6.7±0.17ms  2.58     17.2±0.04ms
bench_1500_chars_30px     1.00      3.2±0.08ms  1.94      6.2±0.04ms
bench_1500_chars_12px     1.00      2.4±0.09ms  1.24      2.9±0.02ms
bench_300_chars_150px     1.00      5.2±0.07ms  2.70     14.0±0.18ms
bench_300_chars_75px      1.00  1850.1±43.15µs  2.41      4.5±0.01ms
bench_300_chars_30px      1.00   891.4±60.93µs  1.68  1497.0±16.47µs
bench_300_chars_12px      1.01   772.7±17.17µs  1.00    766.0±5.30µs
bench_50_chars_150px      1.01      2.6±0.01ms  1.00      2.5±0.00ms
bench_50_chars_75px       1.02   801.7±21.66µs  1.00    788.0±6.46µs
bench_50_chars_30px       1.00    248.0±0.92µs  1.02    252.2±1.05µs
bench_50_chars_12px       1.00    128.7±0.63µs  1.00    128.9±0.74µs
bench_16_chars_150px      1.00    765.9±6.20µs  1.00    768.6±7.30µs
bench_16_chars_75px       1.01    241.2±0.78µs  1.00    239.6±0.92µs
bench_16_chars_30px       1.00     77.3±0.38µs  1.00     77.7±0.31µs
bench_16_chars_12px       1.00     38.6±0.20µs  1.06     40.9±0.17µs

Now only bench_300_chars_30px and larger magnitude work is using multithreading. While not an exact science the work estimate is good enough to ensure mt benches are never slower than st. So this should resolve #125.

alexheretic commented 3 years ago

Using magnitude glyph_count * tallest_h * tallest_h should scale a bit more realistically. This means case bench_1500_chars_12px to use single-thread paths, where it was only 24% slower than mt (I'm looking at roughly 70% slower being the point to select mt).

``` group mt st ----- -- -- bench_1500_chars_150px 1.00 17.0±0.24ms 2.62 44.5±0.23ms bench_1500_chars_75px 1.00 6.6±0.07ms 2.62 17.2±0.04ms bench_1500_chars_30px 1.00 3.1±0.09ms 1.97 6.2±0.04ms bench_1500_chars_12px 1.02 3.0±0.00ms 1.00 2.9±0.02ms bench_300_chars_150px 1.00 5.1±0.06ms 2.72 14.0±0.18ms bench_300_chars_75px 1.00 1818.4±26.07µs 2.45 4.5±0.01ms bench_300_chars_30px 1.00 873.6±46.22µs 1.71 1497.0±16.47µs bench_300_chars_12px 1.00 765.3±2.30µs 1.00 766.0±5.30µs bench_50_chars_150px 1.00 1098.7±47.05µs 2.31 2.5±0.00ms bench_50_chars_75px 1.02 801.3±5.33µs 1.00 788.0±6.46µs bench_50_chars_30px 1.01 255.4±0.85µs 1.00 252.2±1.05µs bench_50_chars_12px 1.00 124.2±0.53µs 1.04 128.9±0.74µs bench_16_chars_150px 1.00 770.4±4.48µs 1.00 768.6±7.30µs bench_16_chars_75px 1.00 238.7±0.71µs 1.00 239.6±0.92µs bench_16_chars_30px 1.00 76.7±0.19µs 1.01 77.7±0.31µs bench_16_chars_12px 1.00 38.4±0.19µs 1.06 40.9±0.17µs ```