Improve the downsampling algorithm

While preparing the talk for UU, I noticed that the quality of images made with our downsampler is very low compared to other downsamplers. I took the time to research why that is, it turned out that we aren't taking nearly enough samples when sampling down. For example, sampling from 2048x2048 down to 512x512, we would always take a 6x6 kernel to sample, while other samplers would take 12x12, and adapt that number further depending on the ratio between the source and target dimensions.

I also took inspiration from how other downsamplers handle working with large numbers of samples by caching some of the math for reuse.

The result is a new implementation which preserves image quality much better, but is about twice slower. The performance can probably be improved by splitting the ISPC kernel into two - one for 3 channels and one for 4 channels, and doing the branch in rust instead of relying on function pointers in ISPC. We might be able to squeeze out more performance with cache optimizations, but it'd need further looking into.

The old implementation is kept in both ISPC and Rust, and be invoked using downsample_fast.

I did some testing with what affects the performance for the test case we have (3 channels, 2048x2048 -> 512x512), and have made some interesting observations.

The branching when reading/writing to the pixel buffers does not seem to cause as significant of a performance drop as first thought. Instead, the drop was caused by the memory reinterpretation that was in sample_3_channels and clean_and_write_3_channels, where a float<4> was interpreted as a float<3> to use a single write. Skipping this reinterpretation and giving the 3-channel and 4-channel versions different signatures causes a significant speed up. This however means we need to have both a float<3> and a float<4> in the function that we can write to in the branch.
Making resample_with_cache inline causes a performance decrease of about 10% for our test case. This is including the usage of ISPC's assume hints, which would make sure that the branches are removed.

Since the gist of this PR is improving quality - and if I understand correctly using weight ""caching"" to get there without outlandish times - how does it fare against resize from a quality perspective? main is already slower than resize (45ms vs 37ms) and this PR bumps us to 61ms:

Downsample `square_test.png` using ispc_downsampler                                                                            
                        time:   [61.593 ms 61.710 ms 61.838 ms]
                        change: [+36.286% +36.552% +36.809%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Downsample `square_test.png` using resize                                                                            
                        time:   [37.770 ms 37.803 ms 37.839 ms]
                        change: [-0.2602% -0.1511% -0.0328%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 19 outliers among 100 measurements (19.00%)
  7 (7.00%) high mild
  12 (12.00%) high severe

EDIT: The win is mostly in debug/dev profiles:

$ cargo bench --profile dev
...
Downsample `square_test.png` using ispc_downsampler                                                                            
                        time:   [108.09 ms 108.13 ms 108.17 ms]
                        change: [+74.846% +75.217% +75.549%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

Benchmarking Downsample `square_test.png` using resize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 120.5s, or reduce sample count to 10.
Downsample `square_test.png` using resize                                                                            
                        time:   [1.1973 s 1.1994 s 1.2013 s]
                        change: [+3066.5% +3072.6% +3078.3%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 23 outliers among 100 measurements (23.00%)
  8 (8.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe

Traverse-Research / ispc-downsampler

Improve the downsampling algorithm #15