Cheaper compute_lowest_and_highest_weight

rygorous commented 2 weeks ago

The actual weights array is constant, so we can compute the min/max weights in a single pass ahead of time. The remapping we apply to the weights depending on rcp_stepsize and offset is monotonic; therefore we can figure out minidx and maxidx in advance from the min/max weights we determined previously.

This in turn means that we don't need to do the "reset" handling in the loop where we zero the accumulator whenever the current min/max changes. The amount of code in the inner loop nest is considerably reduced as a result, and the extra preprocessing outside that loop is generally less than the work it replaces.

The exact impact depends on the block size and max_angular_steps. Testing with RGB textures and -thorough quality, I observe approximately 0.7% coding time reduction with 4x4 blocks; 6x6 and larger see a 1.2% reduction.

At -fast quality, there is not much gain, but neither is it any slower.

In my tests, this change does not alter encoding results (nor should it); the output is bitwise identical to before.

solidpixel commented 2 weeks ago

That's a nice one.

Intel i5 AVX2 performance:

4x4: 1.91 MT/s increases to 1.93 MT/s
6x6: 1.84 MT/s increases to 1.89 MT/s

solidpixel commented 2 weeks ago

Two follow on thoughts that this has triggered, which I'll investigate ...

Can we cache the min/max when we first write the weights array, and avoid the preamble search here?
Failing that, can we pad out the weights array to avoid the preamble loop tail? (I've been trying to avoid loop tails, on the hope I can get around to adding SVE predicated loops at some point).

rygorous commented 2 weeks ago

For 1., haven't looked at where the weights come from yet, possibly, however: 2. yes, quite cheaply. The idea would be to load1 the final weight and then store it back where we read it from. This needs nlanes-1 extra padding after the end of the weights array. Padding with extra copies of one of the weights changes neither the min nor max.

Or writing this down now, an alternative would be to read the entire final aligned group, select() against a load1-ed version of the final weight to put the cloned weights past the end of the real weights, and do an aligned store back. A few more instructions but everything aligned and no data layout changes. I can try that later.

solidpixel commented 2 weeks ago

Hmm, the padding at source isn't totally trivial, at least not without adding more work than it saves.

The storage is already padded and filled with zeros in compute_ideal_colors_and_weights_*().

We can easily change that to fill the weight tail with the last weight value, but we also need to replicate the last weight_error_scale value so that the weight in the tail moves the same as the last real weight under refinement. However, we also rely on that weight_error_scale for the tail being zero so it is not generating values in error accumulation.

Adding masking to either zero or replicate the last value are more expensive than just having the scalar tail as in the PR.

rygorous commented 2 weeks ago

Why? We only need the non-zero-padded weights right around the min/max loop. We can put back the original zero padding immediately after.

solidpixel commented 2 weeks ago

True. I was trying to do it once at source and avoid the two-step later.

Given that the min/max scan loop is only run once, it actually seems fine to just use a predicated scan. Testing here and this doesn't seem measurably slower on an i5.

    vint lane_id = vint::lane_id();
    for (unsigned int i = 0; i < weight_count; i += ASTCENC_SIMD_WIDTH)
    {
        vmask active = lane_id < vint(weight_count);
        lane_id += vint(ASTCENC_SIMD_WIDTH);

        vfloat weights = loada(dec_weight_ideal_value + i);
        min_weight = min(min_weight, select(min_weight, weights, active));
        max_weight = max(max_weight, select(max_weight, weights, active));
    }

ARM-software / astc-encoder

Cheaper compute_lowest_and_highest_weight #516