Closed mostafa-khaled775 closed 1 year ago
Interesting. Thanks!
So in (v + u16x16::splat(255)) / u16x16::splat(256)
, the v + u16x16::splat(255)
part is vrsraq_n_u16(v, v, 8)
and v / u16x16::splat(256)
is vrshrq_n_u16(v, 8)
.
I haven't noticed it.
I will benchmark it as well later. On what arch have you benchmarked it, x86 or ARM?
I benchmarked it at x86 (i5-8300H).
It is about 2.5x faster on Apple M1 Pro. Pretty impressive for such a simple change. Thanks!
Very strange. The new results, at least on ARM, are the same as were 5 months ago. Which was the last time I've run benchmarks. I guess this change not necessarily improves performance, but rather fixed a regression in LLVM. Weird.
Anyway, thanks for the patch.
This significantly improves performance of clipping (on my machine, clipping benchmarks are about 4 times faster).