Netflix / vmaf

Perceptual video quality assessment based on multi-method fusion.
Other
4.49k stars 748 forks source link

Possible optimization in integer vif computation #940

Open foosoftsrlold opened 2 years ago

foosoftsrlold commented 2 years ago

In this code:

            const double eps = 65536 * 1.0e-10;
            double g = sigma12 / (sigma1_sq + eps);

adding eps(ilon) seems not that useful, as g is used only when sigma1_sq > 2

The division by zero is easily avoided moving g and sv_sq computation inside the if (sigma1_sq >= sigma_nsq) branch

igv commented 2 years ago

Even better solution:

            sigma1_sq = MAX(sigma1_sq, eps);
            sigma2_sq = MAX(sigma2_sq, eps);
            sigma12 = MAX(sigma12, eps);

It also makes these blocks redundant. See my adjusted version.

kylophone commented 2 years ago

How does this change the numerical result? What is the savings?

foosoftsrlold commented 2 years ago

Hi Kile,

That epsilon is numerically irrelevant. So irrelevant that I really doubt that any retraining is needed.

g is used only when sigma1_sq is > 2, epsilon is 6.5e-4.... it is a really tiny fraction.

       double g = sigma12 / (sigma1_sq + eps);

In normal cases... sigma1_sq is definitely larger (it is the filtered variance of reference frames). On a simple test sequence, VMAF simply was not impacted. I would be surprised if VMAF change were over 1e-6 for non degenerate cases (i.e. all black)

This is the best I came up with, trying to avoid floating point as much as possible

            sigma2_sq = MAX(sigma2_sq, 0);
            if (sigma1_sq >= sigma_nsq) {
                uint32_t log_den_stage1 = (uint32_t)(sigma_nsq + sigma1_sq);
                int x;
                uint16_t log_den1 = get_best16_from32(log_den_stage1, &x);

                /**
                * log values are taken from the look-up table generated by
                * log_generate() function which is called in integer_combo_threadfunc
                * den_val in float is log2(1 + sigma1_sq/2)
                * here it is converted to equivalent of log2(2+sigma1_sq) - log2(2) i.e log2(2*65536+sigma1_sq) - 17
                * multiplied by 2048 as log_value = log2(i)*2048 i=16384 to 65535 generated using log_value
                * x because best 16 bits are taken
                */
                accum_x += x + 17;
                den_val = log2_table[log_den1];

                // this check can go away, things will work anyway
                if (sigma12 > 0) {
                    // num_val = log2f(1.0f + (g * g * sigma1_sq) / (sv_sq + sigma_nsq));
                    /**
                    * In floating-point numerator = log2((1.0f + (g * g * sigma1_sq)/(sv_sq + sigma_nsq))
                    *
                    * In Fixed-point the above is converted to
                    * numerator = log2((sv_sq + sigma_nsq)+(g * g * sigma1_sq))- log2(sv_sq + sigma_nsq)
                    */

                    const double eps = 65536 * 1.0e-10;
                    double g = sigma12 / (sigma1_sq + eps); // this epsilon can go away
                    int32_t sv_sq = sigma2_sq - g * sigma12;

                    if (sigma2_sq == 0) { // this was... sigma2_sq < eps
                        g = 0.0;
                    }

                    sv_sq = (uint32_t)(MAX(sv_sq, 0));

                    g = MIN(g, vif_enhn_gain_limit);

                    int x1, x2;
                    uint32_t numer1 = (sv_sq + sigma_nsq);
                    int64_t numer1_tmp = (int64_t)((g * g * sigma1_sq)) + numer1; //numerator
                    uint16_t numlog = get_best16_from64((uint64_t)numer1_tmp, &x1);
                    uint16_t denlog = get_best16_from64((uint64_t)numer1, &x2);
                    accum_x2 += (x2 - x1);
                    num_val = log2_table[numlog] - log2_table[denlog];
                    accum_num_log += num_val;
                    accum_den_log += den_val;

                }
                else {
                    //accum_num_log += 0;
                    accum_den_log += den_val;
                }
            }
            else {
                accum_num_non_log += sigma2_sq;
                accum_den_non_log++;
            }

Regarding igv proposal, I'm sure that it's numerically ok, and it's possibly more consistent with float vif. But I humbly believe that the refactoring I propose is in the end more effective, especially if SIMD parallelization is desired (for Intel... I think that AVX512 is required)