Ogeon / palette

A Rust library for linear color calculations and conversion
Apache License 2.0
752 stars 60 forks source link

Implement SIMD support and add `wide` integration #278

Closed Ogeon closed 2 years ago

Ogeon commented 2 years ago

This adds initial support for SIMD types in most places. An exception is the Luv related types, where the conversion logic need extra attention. Some of the conversions aren't necessarily optimal but the focus was on making it work at all.

Integration with the wide crate has been added behind a feature flag, as a first example. More SIMD crates can be added in the future.

Breaking Change

Some functions that used to return bool is now returning a mask type. This mask type is still bool for regular floats and ints, so this change will mostly affect generic code. GetHue was also changed to no longer return Option<T> for SIMD friendliness.

github-actions[bot] commented 2 years ago

Benchmark for 780844c

Click to view benchmark | Test | Base | PR | % | |------|--------------|------------------|---| | Cie family/lab to lch | 2.9±0.07µs | 2.9±0.08µs | 0.00% | | Cie family/lab to xyz | 733.0±15.20ns | 732.5±15.26ns | -0.07% | | Cie family/lch to lab | 2.1±0.05µs | 2.1±0.05µs | 0.00% | | Cie family/linsrgb to xyz | 3.3±0.06µs | **3.2±0.07µs** | **-3.03%** | | Cie family/xyz to lab | 16.4±0.32µs | 16.4±0.47µs | 0.00% | | Cie family/xyz to yxy | 554.9±14.91ns | **473.2±9.12ns** | **-14.72%** | | Cie family/yxy to xyz | 473.3±16.92ns | **446.1±8.45ns** | **-5.75%** | | Matrix functions/matrix_inverse | 9.6±0.33ns | **9.3±0.19ns** | **-3.12%** | | Matrix functions/multiply_3x3 | 12.8±0.26ns | 12.8±0.32ns | 0.00% | | Matrix functions/multiply_rgb_to_xyz | 5.9±0.14ns | 5.9±0.24ns | 0.00% | | Matrix functions/multiply_xyz | 5.9±0.25ns | 5.9±0.20ns | 0.00% | | Matrix functions/multiply_xyz_to_rgb | 5.9±0.15ns | 5.9±0.17ns | 0.00% | | Matrix functions/rgb_to_xyz_matrix | 20.1±0.38ns | 20.2±0.77ns | +0.50% | | Rgb family/hsl to hsv | 556.0±17.99ns | 556.6±20.13ns | +0.11% | | Rgb family/hsl to linear hsl | **8.8±0.17µs** | 10.4±0.20µs | **+18.18%** | | Rgb family/hsl to rgb | **2.0±0.05µs** | 2.1±0.04µs | **+5.00%** | | Rgb family/hsv to hsl | **936.2±19.63ns** | 1261.8±24.21ns | **+34.78%** | | Rgb family/hsv to hwb | 205.4±3.92ns | 205.8±4.61ns | +0.19% | | Rgb family/hsv to linear hsv | **8.8±0.20µs** | 9.9±0.37µs | **+12.50%** | | Rgb family/hsv to rgb | 1996.5±52.13ns | 2.0±0.05µs | +0.18% | | Rgb family/hwb to hsv | 425.7±8.34ns | 425.8±9.23ns | +0.02% | | Rgb family/hwb to linear hwb | **9.9±0.29µs** | 10.4±0.42µs | **+5.05%** | | Rgb family/linear hsl to hsl | **10.0±0.40µs** | 11.6±0.25µs | **+16.00%** | | Rgb family/linear hsv to hsv | **9.0±0.20µs** | 11.0±0.32µs | **+22.22%** | | Rgb family/linear hwb to hwb | **10.0±0.23µs** | 11.6±0.46µs | **+16.00%** | | Rgb family/linsrgb to rgb | 5.5±0.13µs | 5.5±0.12µs | 0.00% | | Rgb family/linsrgb_f32 to rgb_u8 | 6.1±0.13µs | 6.1±0.19µs | 0.00% | | Rgb family/rgb to hsl | **746.6±13.20ns** | 1216.8±33.13ns | **+62.98%** | | Rgb family/rgb to hsv | **603.3±14.15ns** | 1152.6±30.72ns | **+91.05%** | | Rgb family/rgb to linsrgb | 5.2±0.12µs | 5.2±0.12µs | 0.00% | | Rgb family/rgb_u8 to linsrgb_f32 | 5.7±0.12µs | 5.7±0.25µs | 0.00% | | Rgb family/xyz to linsrgb | 5.0±0.10µs | 5.0±0.23µs | 0.00% |
github-actions[bot] commented 2 years ago

Benchmark for 7787441

Click to view benchmark | Test | Base | PR | % | |------|--------------|------------------|---| | Cie family/lab to lch | 3.3±0.09µs | 3.3±0.05µs | 0.00% | | Cie family/lab to xyz | 829.1±12.54ns | 829.8±11.34ns | +0.08% | | Cie family/lch to lab | 2.4±0.04µs | 2.4±0.04µs | 0.00% | | Cie family/linsrgb to xyz | 3.7±0.06µs | 3.7±0.07µs | 0.00% | | Cie family/xyz to lab | 18.6±0.41µs | 18.6±0.53µs | 0.00% | | Cie family/xyz to yxy | 632.6±21.42ns | **534.1±9.35ns** | **-15.57%** | | Cie family/yxy to xyz | 532.5±8.47ns | **504.6±7.63ns** | **-5.24%** | | Matrix functions/matrix_inverse | 10.5±0.18ns | 10.5±0.14ns | 0.00% | | Matrix functions/multiply_3x3 | 14.5±0.37ns | 14.5±0.20ns | 0.00% | | Matrix functions/multiply_rgb_to_xyz | 6.6±0.12ns | 6.6±0.15ns | 0.00% | | Matrix functions/multiply_xyz | 6.6±0.11ns | 6.6±0.11ns | 0.00% | | Matrix functions/multiply_xyz_to_rgb | 6.6±0.12ns | 6.6±0.08ns | 0.00% | | Matrix functions/rgb_to_xyz_matrix | 22.8±0.42ns | 23.0±1.43ns | +0.88% | | Rgb family/hsl to hsv | 624.6±8.23ns | **587.1±8.71ns** | **-6.00%** | | Rgb family/hsl to linear hsl | **10.0±0.15µs** | 11.6±0.23µs | **+16.00%** | | Rgb family/hsl to rgb | **2.3±0.03µs** | 2.4±0.06µs | **+4.35%** | | Rgb family/hsv to hsl | **1045.7±19.40ns** | 1340.4±31.16ns | **+28.18%** | | Rgb family/hsv to hwb | 232.9±5.72ns | 232.4±3.39ns | -0.21% | | Rgb family/hsv to linear hsv | **10.0±0.20µs** | 11.0±0.26µs | **+10.00%** | | Rgb family/hsv to rgb | 2.3±0.04µs | 2.3±0.04µs | 0.00% | | Rgb family/hwb to hsv | 482.8±8.75ns | 482.8±8.13ns | 0.00% | | Rgb family/hwb to linear hwb | **11.2±0.22µs** | 11.5±0.15µs | **+2.68%** | | Rgb family/linear hsl to hsl | **11.3±0.20µs** | 13.1±0.21µs | **+15.93%** | | Rgb family/linear hsv to hsv | **10.2±0.17µs** | 12.3±0.58µs | **+20.59%** | | Rgb family/linear hwb to hwb | **11.3±0.17µs** | 12.9±0.24µs | **+14.16%** | | Rgb family/linsrgb to rgb | 6.2±0.08µs | 6.2±0.29µs | 0.00% | | Rgb family/linsrgb_f32 to rgb_u8 | 6.9±0.10µs | 6.9±0.12µs | 0.00% | | Rgb family/rgb to hsl | **835.5±17.86ns** | 1246.2±17.09ns | **+49.16%** | | Rgb family/rgb to hsv | **687.4±14.40ns** | 1234.3±23.89ns | **+79.56%** | | Rgb family/rgb to linsrgb | 6.0±0.14µs | 6.0±0.13µs | 0.00% | | Rgb family/rgb_u8 to linsrgb_f32 | 6.4±0.09µs | 6.4±0.12µs | 0.00% | | Rgb family/xyz to linsrgb | 5.6±0.07µs | 5.6±0.08µs | 0.00% |
github-actions[bot] commented 2 years ago

Benchmark for 50c6381

Click to view benchmark | Test | Base | PR | % | |------|--------------|------------------|---| | Cie family/lab to lch | 4.0±0.22µs | 3.9±0.20µs | -2.50% | | Cie family/lab to xyz | 1015.0±38.42ns | 1008.9±45.37ns | -0.60% | | Cie family/lch to lab | 2.9±0.28µs | 2.9±0.12µs | 0.00% | | Cie family/linsrgb to xyz | 4.4±0.13µs | 4.5±0.17µs | +2.27% | | Cie family/xyz to lab | 22.5±0.72µs | 22.9±1.11µs | +1.78% | | Cie family/xyz to yxy | 783.3±37.36ns | **652.0±25.90ns** | **-16.76%** | | Cie family/yxy to xyz | 646.9±19.93ns | **618.8±36.49ns** | **-4.34%** | | Matrix functions/matrix_inverse | 12.9±0.49ns | 12.9±0.42ns | 0.00% | | Matrix functions/multiply_3x3 | 17.8±1.08ns | 17.6±0.60ns | -1.12% | | Matrix functions/multiply_rgb_to_xyz | 8.1±0.30ns | 8.1±0.37ns | 0.00% | | Matrix functions/multiply_xyz | 8.1±0.49ns | 8.0±0.39ns | -1.23% | | Matrix functions/multiply_xyz_to_rgb | 8.1±0.34ns | 8.0±0.29ns | -1.23% | | Matrix functions/rgb_to_xyz_matrix | 27.7±1.32ns | 27.5±1.00ns | -0.72% | | Rgb family/hsl to hsv | 760.1±30.94ns | 761.8±30.63ns | +0.22% | | Rgb family/hsl to linear hsl | **12.4±1.19µs** | 14.2±0.72µs | **+14.52%** | | Rgb family/hsl to rgb | 2.8±0.11µs | 2.9±0.33µs | +3.57% | | Rgb family/hsv to hsl | **1274.7±48.60ns** | 1458.4±60.08ns | **+14.41%** | | Rgb family/hsv to hwb | 284.5±14.22ns | 283.2±8.62ns | -0.46% | | Rgb family/hsv to linear hsv | **12.2±0.50µs** | 13.2±0.58µs | **+8.20%** | | Rgb family/hsv to rgb | 2.8±0.14µs | 2.7±0.10µs | -3.57% | | Rgb family/hwb to hsv | **587.8±29.31ns** | 763.5±30.92ns | **+29.89%** | | Rgb family/hwb to linear hwb | **13.6±0.56µs** | 14.2±0.60µs | **+4.41%** | | Rgb family/linear hsl to hsl | **13.9±0.53µs** | 15.3±1.18µs | **+10.07%** | | Rgb family/linear hsv to hsv | **12.6±0.66µs** | 15.6±0.66µs | **+23.81%** | | Rgb family/linear hwb to hwb | **14.2±0.63µs** | 16.4±0.75µs | **+15.49%** | | Rgb family/linsrgb to rgb | 7.5±0.25µs | 7.6±0.37µs | +1.33% | | Rgb family/linsrgb_f32 to rgb_u8 | 8.3±0.23µs | 8.3±0.31µs | 0.00% | | Rgb family/rgb to hsl | **1037.9±47.12ns** | 1528.1±58.44ns | **+47.23%** | | Rgb family/rgb to hsv | **830.8±30.53ns** | 1523.1±164.19ns | **+83.33%** | | Rgb family/rgb to linsrgb | 7.3±0.42µs | 7.3±0.41µs | 0.00% | | Rgb family/rgb_u8 to linsrgb_f32 | 7.8±0.39µs | 8.0±0.61µs | +2.56% | | Rgb family/xyz to linsrgb | **6.9±0.32µs** | 7.5±0.40µs | **+8.70%** |
Ogeon commented 2 years ago

It's a bummer that the RGB to HSL and RGB to HSV conversion is so much slower. I'll try with the old one behind type ID checks (i.e. Great Value Specialization) for now and see if it works better. I should see if I can add benchmarks for the SIMD versions before merging this.

github-actions[bot] commented 2 years ago

Benchmark for 48c254f

Click to view benchmark | Test | Base | PR | % | |------|--------------|------------------|---| | Cie family/lab to lch | 3.2±0.17µs | 3.1±0.21µs | -3.13% | | Cie family/lab to xyz | 799.5±45.43ns | 780.7±46.96ns | -2.35% | | Cie family/lch to lab | 2.3±0.13µs | 2.2±0.13µs | -4.35% | | Cie family/linsrgb to xyz | 3.5±0.31µs | 3.5±0.25µs | 0.00% | | Cie family/xyz to lab | **16.9±0.93µs** | 18.3±2.05µs | **+8.28%** | | Cie family/xyz to yxy | 608.9±34.74ns | **524.6±114.27ns** | **-13.84%** | | Cie family/yxy to xyz | 511.8±30.11ns | **481.6±28.70ns** | **-5.90%** | | Matrix functions/matrix_inverse | 9.8±0.63ns | 9.7±0.55ns | -1.02% | | Matrix functions/multiply_3x3 | 13.5±0.87ns | 13.5±1.42ns | 0.00% | | Matrix functions/multiply_rgb_to_xyz | 6.3±0.41ns | 6.3±0.40ns | 0.00% | | Matrix functions/multiply_xyz | 6.1±0.36ns | 5.9±0.27ns | -3.28% | | Matrix functions/multiply_xyz_to_rgb | 6.3±0.37ns | 6.2±0.38ns | -1.59% | | Matrix functions/rgb_to_xyz_matrix | 21.1±2.25ns | 21.4±1.78ns | +1.42% | | Rgb family/hsl to hsv | **580.8±39.71ns** | 632.4±41.40ns | **+8.88%** | | Rgb family/hsl to linear hsl | **9.3±0.59µs** | 10.4±1.43µs | **+11.83%** | | Rgb family/hsl to rgb | 2.2±0.43µs | 2.3±0.12µs | +4.55% | | Rgb family/hsv to hsl | **1005.7±65.28ns** | 1218.5±65.26ns | **+21.16%** | | Rgb family/hsv to hwb | 218.0±13.57ns | 218.8±25.70ns | +0.37% | | Rgb family/hsv to linear hsv | 9.8±2.69µs | 9.5±0.87µs | -3.06% | | Rgb family/hsv to rgb | 2.1±0.13µs | 2.1±0.13µs | 0.00% | | Rgb family/hwb to hsv | **450.9±31.24ns** | 548.0±32.73ns | **+21.53%** | | Rgb family/hwb to linear hwb | 10.3±0.63µs | 10.3±0.59µs | 0.00% | | Rgb family/linear hsl to hsl | 10.7±0.67µs | 10.8±0.94µs | +0.93% | | Rgb family/linear hsv to hsv | 9.6±0.59µs | 9.8±0.53µs | +2.08% | | Rgb family/linear hwb to hwb | 10.8±0.65µs | 10.6±1.00µs | -1.85% | | Rgb family/linsrgb to rgb | 5.8±0.37µs | 5.7±0.35µs | -1.72% | | Rgb family/linsrgb_f32 to rgb_u8 | 6.4±0.40µs | 6.4±0.41µs | 0.00% | | Rgb family/rgb to hsl | 820.3±54.94ns | 842.7±57.08ns | +2.73% | | Rgb family/rgb to hsv | 646.8±37.43ns | 681.2±155.10ns | +5.32% | | Rgb family/rgb to linsrgb | 5.5±0.30µs | 5.7±0.38µs | +3.64% | | Rgb family/rgb_u8 to linsrgb_f32 | 5.9±0.35µs | 6.0±0.37µs | +1.69% | | Rgb family/xyz to linsrgb | 5.3±0.45µs | 5.4±0.32µs | +1.89% |
Ogeon commented 2 years ago

Looks like the performance gain varies from nothing to several times faster, depending on the work. Converting sRGB to linear RGB is even a bit slower on my machine (possibly due to the powf implementation), converting RGB to HSV or HSL is slightly faster if I use f32x8 but almost equal with f32x4, and converting between XYZ and RGB scales pretty good with the amount of lanes. My CPU is not particularly new, though, so YMMV. As always, with performance.

I don't think I will go through and optimize everything now. Just making sure there's any improvement at all.

Ogeon commented 2 years ago

The benchmark fails because the wide feature isn't on master. But the logs show similar results. And it's pretty cool that it keeps on being feasible to run these benchmarks here!

Ogeon commented 2 years ago

bors r+

bors[bot] commented 2 years ago

Build succeeded: