Closed KYovchevski closed 7 months ago
I did some testing with what affects the performance for the test case we have (3 channels, 2048x2048 -> 512x512), and have made some interesting observations.
sample_3_channels
and clean_and_write_3_channels
, where a float<4>
was interpreted as a float<3>
to use a single write. Skipping this reinterpretation and giving the 3-channel and 4-channel versions different signatures causes a significant speed up. This however means we need to have both a float<3>
and a float<4>
in the function that we can write to in the branch.resample_with_cache
inline
causes a performance decrease of about 10% for our test case. This is including the usage of ISPC's assume
hints, which would make sure that the branches are removed. Since the gist of this PR is improving quality - and if I understand correctly using weight ""caching"" to get there without outlandish times - how does it fare against resize
from a quality perspective? main
is already slower than resize
(45ms vs 37ms) and this PR bumps us to 61ms:
Downsample `square_test.png` using ispc_downsampler
time: [61.593 ms 61.710 ms 61.838 ms]
change: [+36.286% +36.552% +36.809%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
Downsample `square_test.png` using resize
time: [37.770 ms 37.803 ms 37.839 ms]
change: [-0.2602% -0.1511% -0.0328%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 19 outliers among 100 measurements (19.00%)
7 (7.00%) high mild
12 (12.00%) high severe
EDIT: The win is mostly in debug/dev profiles:
$ cargo bench --profile dev
...
Downsample `square_test.png` using ispc_downsampler
time: [108.09 ms 108.13 ms 108.17 ms]
change: [+74.846% +75.217% +75.549%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild
Benchmarking Downsample `square_test.png` using resize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 120.5s, or reduce sample count to 10.
Downsample `square_test.png` using resize
time: [1.1973 s 1.1994 s 1.2013 s]
change: [+3066.5% +3072.6% +3078.3%] (p = 0.00 < 0.05)
Performance has regressed.
Found 23 outliers among 100 measurements (23.00%)
8 (8.00%) low severe
2 (2.00%) low mild
3 (3.00%) high mild
10 (10.00%) high severe
While preparing the talk for UU, I noticed that the quality of images made with our downsampler is very low compared to other downsamplers. I took the time to research why that is, it turned out that we aren't taking nearly enough samples when sampling down. For example, sampling from 2048x2048 down to 512x512, we would always take a 6x6 kernel to sample, while other samplers would take 12x12, and adapt that number further depending on the ratio between the source and target dimensions.
I also took inspiration from how other downsamplers handle working with large numbers of samples by caching some of the math for reuse.
The result is a new implementation which preserves image quality much better, but is about twice slower. The performance can probably be improved by splitting the ISPC kernel into two - one for 3 channels and one for 4 channels, and doing the branch in rust instead of relying on function pointers in ISPC. We might be able to squeeze out more performance with cache optimizations, but it'd need further looking into.
The old implementation is kept in both ISPC and Rust, and be invoked using
downsample_fast
.