perf(pipeline::lowp::div255)

RazrFalcon / tiny-skia

A tiny Skia subset ported to Rust

BSD 3-Clause "New" or "Revised" License

1.12k stars 69 forks source link

perf(pipeline::lowp::div255) #67

Closed mostafa-khaled775 closed 1 year ago

mostafa-khaled775 commented 1 year ago

This significantly improves performance of clipping (on my machine, clipping benchmarks are about 4 times faster).

RazrFalcon commented 1 year ago

Interesting. Thanks!

So in (v + u16x16::splat(255)) / u16x16::splat(256), the v + u16x16::splat(255) part is vrsraq_n_u16(v, v, 8) and v / u16x16::splat(256) is vrshrq_n_u16(v, 8). I haven't noticed it.

I will benchmark it as well later. On what arch have you benchmarked it, x86 or ARM?

mostafa-khaled775 commented 1 year ago

I benchmarked it at x86 (i5-8300H).

RazrFalcon commented 1 year ago

It is about 2.5x faster on Apple M1 Pro. Pretty impressive for such a simple change. Thanks!

RazrFalcon commented 1 year ago

Very strange. The new results, at least on ARM, are the same as were 5 months ago. Which was the last time I've run benchmarks. I guess this change not necessarily improves performance, but rather fixed a regression in LLVM. Weird.

Anyway, thanks for the patch.