AVX2 implementation of DMVR SAD for VVC

stone-d-chen commented 2 months ago

Adds AVX2 assembly for SAD used in DMVR (decoder-side motion vector refinement). The main difference is that in VVC, SAD is only calculated on even rows of the PU to reduce complexity. Implements SAD via min/max/sub for 16bit values.

DMVR is restricted to PUs whose width >= 8, height >=8 and width * height >= 128 (ie 8x8 is not a valid size).

Based on https://github.com/ffvvc/FFmpeg/pull/213#issue-2246959529 but on top of ffmpeg/master

nuomi2021 commented 2 months ago

@QSXW please help review. thank you

nuomi2021 commented 2 months ago

better split to 3 commits

refact the current code.
add avx2
provide check asm and benchmark data. thank you

stone-d-chen commented 2 months ago

Hi @nuomi2021 and @QSXW

Thanks for the feedback! I've forced pushed my changes addressing the feedback.

better split to 3 commits

refact the current code. add avx2 provide check asm and benchmark data. thank you

I split into the 3 commits suggested and attached the benchmark data to the checkasm commit message.

@QSXW I've addressed all the formatting changes and commented above with the new lines where I've addressed the macro changes. I've additionally added some %define's like vvc_mc.asm to clean to improve the clarity of the "magic" numbers.

I've done some changes vvc_sad.c to remove mixed declaration and code warnings (no warning anymore).

nuomi2021 commented 2 months ago

@stone-d-chen , please also cherry pick https://github.com/ffvvc/FFmpeg/tree/workflow to make sure all conformance test passed like https://github.com/ffvvc/FFmpeg/actions/runs/8766181658/job/24058026459?pr=216

stone-d-chen commented 2 months ago

please also cherry pick https://github.com/ffvvc/FFmpeg/tree/workflow to make sure all conformance test passed like https://github.com/ffvvc/FFmpeg/actions/runs/8766181658/job/24058026459?pr=216

I've cherry-picked and pushed the commit

nuomi2021 commented 2 months ago

Before: BQTerrace_1920x1080_60_10_420_22_RA.vvc | 80.7 | Chimera_8bit_1080P_1000_frames.vvc | 158.0 | NovosobornayaSquare_1920x1080.bin | 159.7 | RitualDance_1920x1080_60_10_420_37_RA.266 | 146.3 |

After: BQTerrace_1920x1080_60_10_420_22_RA.vvc | 82.7 | Chimera_8bit_1080P_1000_frames.vvc | 167.0 | NovosobornayaSquare_1920x1080.bin | 166.3 | RitualDance_1920x1080_60_10_420_37_RA.266 | 154.0 |

👍, please move this to the second commit(the asm code one)

stone-d-chen commented 2 months ago

Hi @nuomi2021,

I've pushed the suggested change.

I've also rebased on the latest ffvvc/up and tested conformance on the latest tests and the failing test passes now locally.

    dpb_max_num_reorder_pics.266
    sintel_120.266
    tiles_720p5994_stockholm_ter.266

total = 315, passed = 315, failed = 0, skipped = 0

stone-d-chen commented 1 month ago

Hi @QSXW

I received a question on the ML about why not switching on block width inside the asm. I pushed a recent change that adds vvc_sad that merges together vvc_sad_8, vvc_sad_16, vvc_sad_32_128 and jumps based on the block width.

Let me know what you think and I will clean up this branch and then resubmit to the ML.

QSXW commented 1 month ago

Hi @QSXW

I received a question on the ML about why not switching on block width inside the asm. I pushed a recent change that adds vvc_sad that merges together vvc_sad_8, vvc_sad_16, vvc_sad_32_128 and jumps based on the block width.

Let me know what you think and I will clean up this branch and then resubmit to the ML.

Sure. That's what dav1d did. It can be changed into one function and jump based on the block width.

stone-d-chen commented 1 month ago

I've pushed the new version with a single function and merging 16_128 into one code path within. I'll send the updated version to the ML soon-ish

QSXW commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

stone-d-chen commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

The . is syntax for a local label so it feels cleaner to have vvc_sad_8, etc not have the . in front, they will be global labels and .loop_width is the local label within.

https://www.nasm.us/doc/nasmdoc3.html#section-3.9

QSXW commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

The . is syntax for a local label so it feels cleaner to have vvc_sad_8, etc not have the . in front, they will be global labels and .loop_width is the local label within.

https://www.nasm.us/doc/nasmdoc3.html#section-3.9

Great. Please send the v2 patch to the maillist.

stone-d-chen commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

The . is syntax for a local label so it feels cleaner to have vvc_sad_8, etc not have the . in front, they will be global labels and .loop_width is the local label within. https://www.nasm.us/doc/nasmdoc3.html#section-3.9

Great. Please send the v2 patch to the maillist.

Sounds good, just sent.

QSXW commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

The . is syntax for a local label so it feels cleaner to have vvc_sad_8, etc not have the . in front, they will be global labels and .loop_width is the local label within.

https://www.nasm.us/doc/nasmdoc3.html#section-3.9

I just realized that if we define a function in C like void vvc_sad_8, will it conflict with our label? Can you help confirm that?

stone-d-chen commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

The . is syntax for a local label so it feels cleaner to have vvc_sad_8, etc not have the . in front, they will be global labels and .loop_width is the local label within. https://www.nasm.us/doc/nasmdoc3.html#section-3.9

I just realized that if we define a function in C like void vvc_sad_8, will it conflict with our label? Can you help confirm that?

Seems okay, the labels seems to be stripped from the symbol table and only the function entry point is defined

vvc_sad.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 libavcodec/x86/vvc/vvc_sad.asm
0000000000000000 l    d  .note.GNU-stack        0000000000000000 .note.GNU-stack
0000000000000000 l    d  .note.gnu.property     0000000000000000 .note.gnu.property
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 g     F .text  0000000000000000 .hidden ff_vvc_sad_avx2

QSXW commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

The . is syntax for a local label so it feels cleaner to have vvc_sad_8, etc not have the . in front, they will be global labels and .loop_width is the local label within. https://www.nasm.us/doc/nasmdoc3.html#section-3.9

I just realized that if we define a function in C like void vvc_sad_8, will it conflict with our label? Can you help confirm that?

Seems okay, the labels seems to be stripped from the symbol table and only the function entry point is defined
vvc_sad.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 libavcodec/x86/vvc/vvc_sad.asm
0000000000000000 l    d  .note.GNU-stack        0000000000000000 .note.GNU-stack
0000000000000000 l    d  .note.gnu.property     0000000000000000 .note.gnu.property
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 g     F .text  0000000000000000 .hidden ff_vvc_sad_avx2

Okay. Seemed that the compiler would help us strip them. How about the ./configure --disable-stripping --disable-optimizations --enable-debug?

stone-d-chen commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

The . is syntax for a local label so it feels cleaner to have vvc_sad_8, etc not have the . in front, they will be global labels and .loop_width is the local label within. https://www.nasm.us/doc/nasmdoc3.html#section-3.9

I just realized that if we define a function in C like void vvc_sad_8, will it conflict with our label? Can you help confirm that?

Seems okay, the labels seems to be stripped from the symbol table and only the function entry point is defined
vvc_sad.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 libavcodec/x86/vvc/vvc_sad.asm
0000000000000000 l    d  .note.GNU-stack        0000000000000000 .note.GNU-stack
0000000000000000 l    d  .note.gnu.property     0000000000000000 .note.gnu.property
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 g     F .text  0000000000000000 .hidden ff_vvc_sad_avx2
Okay. Seemed that the compiler would help us strip them. How about the ./configure --disable-stripping --disable-optimizations --enable-debug?

Same objdump

stone@fedora:~/repos/ffvvc_stone$ objdump --syms libavcodec/x86/vvc/vvc_sad.o

libavcodec/x86/vvc/vvc_sad.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 libavcodec/x86/vvc/vvc_sad.asm
0000000000000000 l    d  .note.GNU-stack        0000000000000000 .note.GNU-stack
0000000000000000 l    d  .note.gnu.property     0000000000000000 .note.gnu.property
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 g     F .text  0000000000000000 .hidden ff_vvc_sad_avx2

QSXW commented 1 month ago

Should we also add . to the label vvc_sad_8 and vvc_sad_16 like .loop_width?

The . is syntax for a local label so it feels cleaner to have vvc_sad_8, etc not have the . in front, they will be global labels and .loop_width is the local label within. https://www.nasm.us/doc/nasmdoc3.html#section-3.9

I just realized that if we define a function in C like void vvc_sad_8, will it conflict with our label? Can you help confirm that?

Seems okay, the labels seems to be stripped from the symbol table and only the function entry point is defined
vvc_sad.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 libavcodec/x86/vvc/vvc_sad.asm
0000000000000000 l    d  .note.GNU-stack        0000000000000000 .note.GNU-stack
0000000000000000 l    d  .note.gnu.property     0000000000000000 .note.gnu.property
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 g     F .text  0000000000000000 .hidden ff_vvc_sad_avx2
Okay. Seemed that the compiler would help us strip them. How about the ./configure --disable-stripping --disable-optimizations --enable-debug?
Same objdump
stone@fedora:~/repos/ffvvc_stone$ objdump --syms libavcodec/x86/vvc/vvc_sad.o

libavcodec/x86/vvc/vvc_sad.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 libavcodec/x86/vvc/vvc_sad.asm
0000000000000000 l    d  .note.GNU-stack        0000000000000000 .note.GNU-stack
0000000000000000 l    d  .note.gnu.property     0000000000000000 .note.gnu.property
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 g     F .text  0000000000000000 .hidden ff_vvc_sad_avx2

Great. That should be no issue here. Thanks for the confirmation.

stone-d-chen commented 1 month ago

hmm saw the checkasm segfaults on windows .. will need to investigate

QSXW commented 1 month ago

hmm saw the checkasm segfaults on windows .. will need to investigate

You can use the VS to debug the checkasm. If the function crashes, it will navigate you to the assembler to check which instruction triggers the exception.

stone-d-chen commented 1 month ago

hmm saw the checkasm segfaults on windows .. will need to investigate

You can use the VS to debug the checkasm. If the function crashes, it will navigate you to the assembler to check which instruction triggers the exception.

So the segfault only seems to occur when doing "call_new" calling the function manually doesn't result in a segfault... someone on IRC suggested

it could be something like using a 64bit register where the function prototype only has int32. calling manually may zero-extend or sign-extend the input, whereas call_new() intentionally puts garbage in any registers that aren't guaranteed to be meaningful.

so I'll investigate along these lines

stone-d-chen commented 1 month ago

hmm saw the checkasm segfaults on windows .. will need to investigate

You can use the VS to debug the checkasm. If the function crashes, it will navigate you to the assembler to check which instruction triggers the exception.

So the segfault only seems to occur when doing "call_new" calling the function manually doesn't result in a segfault... someone on IRC suggested

it could be something like using a 64bit register where the function prototype only has int32. calling manually may zero-extend or sign-extend the input, whereas call_new() intentionally puts garbage in any registers that aren't guaranteed to be meaningful.

so I'll investigate along these lines

Turns out this is the case, it seems like the offset calculation requires a movsxd since the dx/dy are ints but I'm treating them as quads to do the pointer calculation

stone-d-chen commented 1 month ago

Pushed the changes, confirmed working on windows, will send to ML soon.

nuomi2021 commented 1 month ago

@stone-d-chen , we have merged the alf asm. remember rebase before you send to upstream.

thank you

stone-d-chen commented 1 month ago

@stone-d-chen , we have merged the alf asm. remember rebase before you send to upstream.

thank you

Hi @nuomi2021, yeah the current version here and on ML is rebased against the latest ffvvc:up so it contains the alf asm.

stone-d-chen commented 1 month ago

Hi all,

Pushed another version with Ronald's suggested change here and sent to the ML. I might have messed up the send-email, let me know if I need to re-send it.

Thanks, Stone

QSXW commented 1 month ago

Hi all,

Pushed another version with Ronald's suggested change here and sent to the ML. I might have messed up the send-email, let me know if I need to re-send it.

Thanks, Stone

No worries. It's very common to resend patches again and again for different comments.

nuomi2021 commented 1 month ago

Hi all,

Pushed another version with Ronald's suggested change here and sent to the ML. I might have messed up the send-email, let me know if I need to re-send it.

Thanks, Stone

No worries. This is one reason why open source code quality is often better than closed source.

stone-d-chen commented 1 month ago

Latest version (v5) with Ronald's additional feedback pushed and sent to ML. One thing of note was changing the function signature to have intptr_t dx, dy which removed the need for movsxd. I confirmed this passes checkasm on windows as well.

nuomi2021 commented 1 month ago

@stone-d-chen , don't get frustrated. Every new version helps to polish your skills.

stone-d-chen commented 1 month ago

Oh yeah, I don't mind :) definitely learning a lot with each revision! Will send a new version with latest feedback soon-ish

stone-d-chen commented 1 month ago

https://patchwork.ffmpeg.org/project/ffmpeg/list/?series=11863

ffvvc / FFmpeg

AVX2 implementation of DMVR SAD for VVC #215