Open frankplow opened 1 year ago
629db286ee2ad9225b7f204677ffd48524cff924 disables the optimisations when sps_extended_precision_flag
is set. A new set of functions will need to be written in order to support transform coefficients larger than 16 bits.
Did you check the hevc idct checkasm output? is it aligned with your result? thank you
vvc_inv_dct2_2_c: 16.0 vvc_inv_dct2_2_avx2: 16.2 vvc_inv_dct2_4_c: 19.7 vvc_inv_dct2_4_avx2: 17.0 vvc_inv_dct2_8_c: 39.7 vvc_inv_dct2_8_avx2: 29.2 vvc_inv_dct2_16_c: 132.7 vvc_inv_dct2_16_avx2: 54.5 vvc_inv_dct2_32_c: 379.0 vvc_inv_dct2_32_avx2: 115.7 vvc_inv_dct2_64_c: 1527.7 vvc_inv_dct2_64_avx2: 385.0
Did you check the hevc idct checkasm output? is it aligned with your result? thank you
vvc_inv_dct2_2_c: 16.0 vvc_inv_dct2_2_avx2: 16.2 vvc_inv_dct2_4_c: 19.7 vvc_inv_dct2_4_avx2: 17.0 vvc_inv_dct2_8_c: 39.7 vvc_inv_dct2_8_avx2: 29.2 vvc_inv_dct2_16_c: 132.7 vvc_inv_dct2_16_avx2: 54.5 vvc_inv_dct2_32_c: 379.0 vvc_inv_dct2_32_avx2: 115.7 vvc_inv_dct2_64_c: 1527.7 vvc_inv_dct2_64_avx2: 385.0
Here are the relevant entries from the HEVC IDCT checkasm benchmark:
hevc_idct_4x4_8_c: 141.5
hevc_idct_4x4_8_avx: 44.7
hevc_idct_4x4_10_c: 133.2
hevc_idct_4x4_10_avx: 43.5
hevc_idct_8x8_8_c: 870.7
hevc_idct_8x8_8_avx: 134.2
hevc_idct_8x8_10_c: 879.2
hevc_idct_8x8_10_avx: 137.2
hevc_idct_16x16_8_c: 5861.0
hevc_idct_16x16_8_avx: 696.2
hevc_idct_16x16_10_c: 5835.5
hevc_idct_16x16_10_avx: 695.5
hevc_idct_32x32_8_c: 47877.5
hevc_idct_32x32_8_avx: 3863.0
hevc_idct_32x32_10_c: 47965.5
hevc_idct_32x32_10_avx: 3856.2
Note that the HEVC optimisations are performed at the 2D level rather than the 1D level. Many of the instructions in the SIMD optimisations are spent loading data into and extracting data from the SIMD registers. This is all the more true for FFVVC due to the strides in the IDCT function signature. The FFVVC IDCT can be optimised at the 2D level in the future to get performance gains closer to HEVC's, but for now the 1D optimisations work alone and they provide the backbone needed for any future optimisation.
but for now the 1D optimisations work alone and they provide the backbone needed for any future optimisation.
how about dav1d, it has similar 1d function. or 2d only
but for now the 1D optimisations work alone and they provide the backbone needed for any future optimisation.
how about dav1d, it has similar 1d function. or 2d only
dav1d uses 2D and then some, incorporating some of the vectorisation as well to save a transpose operation. According to this lecture, this allowed them to double performance compared to only 1D SIMD optimisations. It's worth noting that doing these higher-level optimisations comes at a cost in terms of complexity though. dav1d has over 10,000 lines of inverse transform assembly for AVX2 alone!
dav1d has over 10,000 lines of inverse transform assembly for AVX2 alone!
It was worth it. dav1d is most fast decoder in we see so far. and the current vvc transform function for some files cost 10% cpu. Is it possible, just use their code directly? DCT functions are similar. we may only need to change some parameters(asm tables)
Is it possible, just use their code directly? DCT functions are similar. we may only need to change some parameters(asm tables)
I will look into this. I don't think it will be quite this simple - some internal data representations in FFVVC will need to be changed as dav1d relies on packed input data but it looks like there is only one place non-packed transform coefficients are actually used in FFVVC.
I will look into this
👍, we can start with 2x2 or 4x4 block. zero the entire block and set the fireset coeff to 1
some internal data representations in FFVVC will need to be changed no problem. You can do any reasonable change
Rebase and re-target onto main.
Reset to 6105322ca9a2e4e5ce7e33505427f68c2b88dbd7. Work done porting FFmpeg HEVC ASM can now be found at frankplow:ffmpeg-hevc-idct
/#130. This has been done as there is little in common between the two trees.
This PR adds AVX2 optimisations for the type-II DCT. For now, these optimisations are only implemented at the 1D level.
Performance results: