Open nuomi2021 opened 1 year ago
Can you explain this issue, please? I want to work on this issue as a qualification task for GSOC 2023
@Anant-2005 , thank you for your interest. if you check the lmcs_filter_luma It's very simple, Using VPGATHERDD instruction we can gather 8 pixels at one time. It may speed up the process.
@nuomi2021 I was asking if I have to create an asm file or I have to embed inline assembly, and my second question is which assembler do we use, coz there were a lot of assembler available, when I researched about it
Hiya,
Is anyone currently working on this? If not could I try it?
Hi @stone-d-chen thank you for being interested in this for GSOC or just want to help with the project. If for GSOC, please start work on inter or sao for arm. you can check the upstream c and x86 asm code to see how to do it with arm If you want to help the project, please go ahead, and do what interests you most.
thank you
Hi @nuomi2021
Oh has the non-arm vvc project been taken? I've already met the qualification requirement (patch accepted for ffmpeg) but I wanted to try this as well.
I will attempt this regardless, more asking so I can plan my time re: setting up an arm dev environment, etc.
Thanks!
I've already met the qualification requirement (patch accepted for ffmpeg) but I wanted to try this as well.
Good to know.
Oh has the non-arm vvc project been taken?
No, but maybe you can choose a tough one (and will used by many phones/macs )
I will attempt this regardless
👍
No, but maybe you can choose a tough one
Fair enough 😂 I'll give it an attempt
Quick question re: VPGATHERDD, it seems like since it's only operating on int32 and the arrays (at least with the example video) are 16 bit. So I was thinking a way to do it would loading using punpcklwd
and register of 0s to pad out the pixels. Then shifting off the garbage bits.
Rough outline:
mova m1, [srcq]
punpcklwd m1, m0 ; pxor m0 m0
vpgatherdd m2, [lutq + m1 * 2], m4
vpslld m3, m2, 16
vpsrld m3, 16
; final pack and write back
packssdw m0, m3, m4
mova [srcq], m0
Mainly wondering if I'm missing a int16 version of vpgather
@stone-d-chen Could you use PBLENDW
?
Ignore me, I didn't realise lut
was signalled rather than a constant. I don't think there's an equivalent to VPGATHERD
which acts on words in AVX2.
vvdec has an implementation, you can refer to it :)
vvdec has an implementation, you can refer to it :)
ah took me a bit to realize that vvdec was a different repo haha. They are using a shuf instead of shifting.
Results from my rough draft shows a speedup. I've only used 2 vpgatherdds per loop so far to simplify. It seems like having another set would be helpful since according to fog's table it has latency 24 and cpi 5.
Next I think I probably should generalize this? I assumed width = 128, pixel was 2 bytes etc. I've been looking more into how the macro system works.
I was also wondering if there was a more official way of comparing outputs, I just eyeball'd so far since any errors were very obvious. I saw there was some conformance tests.
Before
+ 4.48% 4.48% ffmpeg_g ffmpeg_g [.] lmcs_filter_luma_10
0.16% 0.16% ffmpeg_g ffmpeg_g [.] lmcs_scale_chroma_10
0.00% 0.00% ffmpeg_g ffmpeg_g [.] run_lmcs
0.00% 0.00% ffmpeg_g ffmpeg_g [.] ff_vvc_lmcs_filter
After
+ 1.26% 1.26% ffmpeg_g ffmpeg_g [.] lmcs_filter_luma_10
0.15% 0.15% ffmpeg_g ffmpeg_g [.] lmcs_scale_chroma_10
0.01% 0.01% ffmpeg_g ffmpeg_g [.] run_lmcs
0.01% 0.01% ffmpeg_g ffmpeg_g [.] ff_vvc_lmcs_filter
you are so fast. not always 128, it can be 32 or 64, not always 2 bytes, it can be 1 bytes for 8 bpc. better start like this:
Quick update/Q's
Updated the 2 byte version to take multiple widths, I noticed however there were some 8 and 16 pixel widths in predict_inter (printing cu->cb_width
) are these also possible widths?
New profiling numbers: I messed up the profiling originally before ~= 4.29% after ~= 3.11% So maybe 35% faster.
I have an 1byte version working, going to start writing checkasms. Also is there an 8bit video I can test against as well?
Children Self Command Shared O Symbol
+ 4.29% 4.29% ffmpeg_g ffmpeg_g [.] lmcs_filter_luma_10
0.14% 0.14% ffmpeg_g ffmpeg_g [.] lmcs_scale_chroma_10
0.01% 0.01% ffmpeg_g ffmpeg_g [.] run_lmcs
0.00% 0.00% ffmpeg_g ffmpeg_g [.] ff_vvc_lmcs_filter
Children Self Command Shared O Symbol
+ 2.97% 2.97% ffmpeg_g ffmpeg_g [.] ff_lmcs_128_16bpc_avx2
0.14% 0.14% ffmpeg_g ffmpeg_g [.] lmcs_scale_chroma_10
0.14% 0.14% ffmpeg_g ffmpeg_g [.] lmcs_filter_luma_10
0.01% 0.01% ffmpeg_g ffmpeg_g [.] run_lmcs
0.01% 0.01% ffmpeg_g ffmpeg_g [.] ff_vvc_lmcs_filter
In my fork I've created a pr with my current implementation. https://github.com/stone-d-chen/ffvvc/pull/1
// fails
IBC_E_Tencent_1.bit
CodingToolsSets_D_Tencent_2.bit
IBC_D_Tencent_2.bit
IBC_C_Tencent_2.bit
IBC_A_Tencent_2.bit
IBC_B_Tencent_2.bit
sintel_120.266
LOSSLESS_B_HHI_3.bit
10b444_A_Kwai_3.bit
10b444_B_Kwai_3.bit
@stone-d-chen good progress. Please use the upstream version. it will support more conformance clips especially for IBC
Yep that fixed the issue! All conformance tests pass now https://github.com/stone-d-chen/ffvvc/pull/3
will check it next week, thank you @stone-d-chen
Sounds good no rush, @nuomi2021
Latest update:
https://github.com/stone-d-chen/ffvvc/pull/4
I'll probably take a pause on this for now until you take a look, I might start looking at the arm instructions and/or the x86 deblocking avx code, been spending some time reading how those algos are implementeed.
@QSXW could you also help review https://github.com/stone-d-chen/ffvvc/pull/4 thank you
based on this LMCS consume about 2.81% time for Tango2_3840x2160_60_10_420_27_LD.266, maybe we can use VPGATHERDD to optimize it.
11.96% ffmpeg_g [.] put_vvc_luma_hv_10 5.88% ffmpeg_g [.] alf_get_coeff_and_clip_10 5.25% ffmpeg_g [.] ff_vvc_inv_dct2_64 4.30% [kernel] [k] lock_text_start 4.22% ffmpeg_g [.] ff_vvc_alf_filter_luma_w16_16bpc_avx2 3.46% ffmpeg_g [.] put_vvc_luma_bi_hv_10 3.45% ffmpeg_g [.] alf_filter_luma_vb_10 3.13% ffmpeg_g [.] vvc_loop_filter_luma_10 2.81% ffmpeg_g [.] lmcs_filter_luma_10 2.46% ffmpeg_g [.] put_vvc_luma_uni_hv_10 2.27% ffmpeg_g [.] put_vvc_chroma_hv_10 2.21% libc-2.31.so [.] 0x000000000018b733 2.05% libc-2.31.so [.] 0x000000000018bb41 1.95% ffmpeg_g [.] put_vvc_chroma_uni_hv_10 1.84% ffmpeg_g [.] put_vvc_chroma_bi_hv_10 1.81% ffmpeg_g [.] vvc_deblock_bs 1.41% ffmpeg_g [.] ff_vvc_predict_inter 1.25% libpthread-2.31.so [.] pthread_mutex_lock 1.24% libpthread-2.31.so [.] __pthread_mutex_unlock 1.22% ffmpeg_g [.] ff_vvc_residual_coding 1.08% ffmpeg_g [.] alf_filter_cc_10 1.03% ffmpeg_g [.] apply_prof_uni_10 0.99% ffmpeg_g [.] ff_vvc_alf_filter 0.98% ffmpeg_g [.] ff_vvc_inv_dct2_32 0.94% ffmpeg_g [.] vvc_deblock_bs_luma_vertical 0.92% ffmpeg_g [.] add_residual_10