enable deblock luma strong filter

nuomi2021 commented 1 month ago

@stone-d-chen , vvc_v_loop_filter_luma_10_strong_avx still failed. but you can take it as reference about how to load maxlen{q, p} in chrome.

thank you

stone-d-chen commented 1 month ago

Thanks I'll take a look!

nuomi2021 commented 1 month ago

@stone-d-chen checked the deblock code.

It always operates at 8 lines at a time for luma, with each set of 8 lines containing 2 deblock units. When one deblock unit is weak and another is strong, it still requires a strong filter and mask. Additionally, most filter type decisions are not vectorizable, making this approach inefficient.

How about we hook into loop_filter_luma_large, loop_filter_luma_strong, and loop_filter_luma_weak and let the C code handle the filter decision?

it will simplify our asm code.

stone-d-chen commented 1 month ago

How about we hook into loop_filter_luma_large, loop_filter_luma_strong, and loop_filter_luma_weak and let the C code handle the filter decision?

it will simplify our asm code.

I'm okay with it on the basis of simplifying

It always operates at 8 lines at a time for luma, with each set of 8 lines containing 2 deblock units. When one deblock unit is weak and another is strong, it still requires a strong filter and mask.

I've only skimmed the Luma code so there might be a complexity that makes this infeasible but I think my concern re: performance is that the C computes the decision for the first deblock unit, then computes the appropriate filter, then the second decision and then the second filter.

https://github.com/ffvvc/FFmpeg/blob/2aab4e4cc0b4a666c7e5a752b25a337710b20bdb/libavcodec/vvc/filter_template.c#L543

If we directly hook each filter, this means we can only process 4 pixels at a time, or just half of an xmm register. I think this leaves a lot of performance on the table since my assumption is that the statistically most common case is that both deblock units will have the same filter (e.g. weak/weak or strong/strong) for most videos, particularly for 4K/8K videos (assumption based 8x8 pixels in high res video should be very similar).

ie I think the ability to compute 8 pixels at a time should be always worth it - the overhead of dealing with the mixed masking would need to be at least 2x the computation time - to break even on the benefit of loading 8 pixels. This is under the assumption that mixed deblock units are evenly distributed with non-mixed blocks, which I don't think is the case.

Re: the filter decision type, I think there's still enough benefit to loading 8 pixels where it's probably not a net-negative. At a minimum each deblock unit performs the same computation on the (0,3),(0,3) lines for the two deblock units so there's something like 2x speed up there. This could however get negated by the complexity of computing the masks.

I think another way would be to modify to the C code so that we defer to the filter computation until the decision for each deblock unit is calculated, conceptually:

int deblock_queue[2];
for (int i = 0; i < 2; i++) {

    if luma_large {
        deblock_queue[i] = 0;
    } else if strong {
        deblock_queue[i] = 1;
    } else if weak {
        deblock_queue[i] = 2;
    }
}

process_deblock(deblock_queue);

// filters can process arbitrary number of deblock units at a time

I'm not sure how the CTUs work but if they're mostly contiguous, we could also load 4 deblock units at a time and fill an entire YMM register.

nuomi2021 commented 1 month ago

How about we hook into loop_filter_luma_large, loop_filter_luma_strong, and loop_filter_luma_weak and let the C code handle the filter decision? it will simplify our asm code.

I'm okay with it on the basis of simplifying

It always operates at 8 lines at a time for luma, with each set of 8 lines containing 2 deblock units. When one deblock unit is weak and another is strong, it still requires a strong filter and mask.

I've only skimmed the Luma code so there might be a complexity that makes this infeasible but I think my concern re: performance is that the C computes the decision for the first deblock unit, then computes the appropriate filter, then the second decision and then the second filter.

https://github.com/ffvvc/FFmpeg/blob/2aab4e4cc0b4a666c7e5a752b25a337710b20bdb/libavcodec/vvc/filter_template.c#L543

If we directly hook each filter, this means we can only process 4 pixels at a time, or just half of an xmm register. I think this leaves a lot of performance on the table since my assumption is that the statistically most common case is that both deblock units will have the same filter (e.g. weak/weak or strong/strong) for most videos, particularly for 4K/8K videos (assumption based 8x8 pixels in high res video should be very similar).

yes, this is my concern too. if 8 pixels can is important. maybe we can feed entire CTU line to asm code. So avx2 can use 16 pixels approach. We need do mask for all deblock types, longer width give us more performance.

ie I think the ability to compute 8 pixels at a time should be always worth it - the overhead of dealing with the mixed masking would need to be at least 2x the computation time - to break even on the benefit of loading 8 pixels. This is under the assumption that mixed deblock units are evenly distributed with non-mixed blocks, which I don't think is the case.

we can't not save the 2x compute time. If it's mixed, you need to do it for weak and strong(2x) anyway.

Re: the filter decision type, I think there's still enough benefit to loading 8 pixels where it's probably not a net-negative. At a minimum each deblock unit performs the same computation on the (0,3),(0,3) lines for the two deblock units so there's something like 2x speed up there. This could however get negated by the complexity of computing the masks.

I think another way would be to modify to the C code so that we defer to the filter computation until the decision for each deblock unit is calculated, conceptually:
int deblock_queue[2];
for (int i = 0; i < 2; i++) {

    if luma_large {
        deblock_queue[i] = 0;
    } else if strong {
        deblock_queue[i] = 1;
    } else if weak {
        deblock_queue[i] = 2;
    }
}

process_deblock(deblock_queue);

// filters can process arbitrary number of deblock units at a time
I'm not sure how the CTUs work but if they're mostly contiguous, we could also load 4 deblock units at a time and fill an entire YMM register.

You can think CTU is a 32x32, 64x64 or 128x128 memory block. It may short than 32/64/128, because the last row or col. I can do the change to give all pixels in a CTU row/col to asm code if you agreed https://github.com/ffvvc/FFmpeg/blob/deblock_asm/libavcodec/vvc/filter.c#L813

BTW: the avx2 shipped at 2013. Most of PCs already have AVX2. How about we support avx2 only. It will make our code easier.

stone-d-chen commented 1 month ago

Sorry for the late reply, work has been crazy

If we directly hook each filter, this means we can only process 4 pixels at a time, or just half of an xmm register. I think this leaves a lot of performance on the table since my assumption is that the statistically most common case is that both deblock units will have the same filter (e.g. weak/weak or strong/strong) for most videos, particularly for 4K/8K videos (assumption based 8x8 pixels in high res video should be very similar).

yes, this is my concern too. if 8 pixels can is important. maybe we can feed entire CTU line to asm code. So avx2 can use 16 pixels approach. We need do mask for all deblock types, longer width give us more performance.

ie I think the ability to compute 8 pixels at a time should be always worth it - the overhead of dealing with the mixed masking would need to be at least 2x the computation time - to break even on the benefit of loading 8 pixels. This is under the assumption that mixed deblock units are evenly distributed with non-mixed blocks, which I don't think is the case.

we can't not save the 2x compute time. If it's mixed, you need to do it for weak and strong(2x) anyway.

Yeah my numbers are off, I meant that it depends both on the overhead for computing masks & how many are mixed / not mixed, since we do both weak / strong for mixed like you said.

I'm not sure how the CTUs work but if they're mostly contiguous, we could also load 4 deblock units at a time and fill an entire YMM register.

You can think CTU is a 32x32, 64x64 or 128x128 memory block. It may short than 32/64/128, because the last row or col. I can do the change to give all pixels in a CTU row/col to asm code if you agreed https://github.com/ffvvc/FFmpeg/blob/deblock_asm/libavcodec/vvc/filter.c#L813

I'm not super familiar with the code but that works for me.

BTW: the avx2 shipped at 2013. Most of PCs already have AVX2. How about we support avx2 only. It will make our code easier.

That makes sense to me, maybe it'd be good to think about future avx512 support as well, though not necessarily actually doing it.

nuomi2021 commented 1 month ago

Sorry for the late reply, work has been crazy

It's a good thing—it means your job is secure!

That makes sense to me, maybe it'd be good to think about future avx512 support as well, though not necessarily actually doing it.

avx512 is not widely adopted. The latest Intel platform does not support it https://en.wikipedia.org/wiki/AVX-512#endnote_adl-avx512-note

stone-d-chen commented 1 month ago

Sorry for the late reply, work has been crazy

It's a good thing—it means your job is secure!

True! 😂

That makes sense to me, maybe it'd be good to think about future avx512 support as well, though not necessarily actually doing it.

avx512 is not widely adopted. The latest Intel platform does not support it https://en.wikipedia.org/wiki/AVX-512#endnote_adl-avx512-note

oh interesting did not know that, I've only been using AMD cpus lately

ffvvc / FFmpeg

enable deblock luma strong filter #256