ffvvc / FFmpeg

VVC Decoder for ffmpeg
Other
50 stars 12 forks source link

FFmpeg HEVC IDCT port #130

Open frankplow opened 1 year ago

frankplow commented 1 year ago

This PR ports FFmpeg's HEVC IDCT optimisations.

To-do:

Checkasm benchmark results:

inv_dct2_dct2_4x4_8_c: 226.2
inv_dct2_dct2_4x4_8_avx2: 26.2
inv_dct2_dct2_4x4_10_c: 188.2
inv_dct2_dct2_4x4_10_avx2: 24.7
inv_dct2_dct2_8x8_8_c: 704.7
inv_dct2_dct2_8x8_8_avx2: 124.7
inv_dct2_dct2_8x8_10_c: 751.2
inv_dct2_dct2_8x8_10_avx2: 124.7
inv_dct2_dct2_16x16_8_c: 4289.7
inv_dct2_dct2_16x16_8_avx2: 621.2
inv_dct2_dct2_16x16_10_c: 4335.2
inv_dct2_dct2_16x16_10_avx2: 625.2
perf.py results: Bitstream Before After Delta
RitualDance_1920x1080_60_10_420_32_LD 99.7 99.3 -0.4%
RitualDance_1920x1080_60_10_420_37_RA 88.3 87.7 -0.5%
Tango2_3840x2160_60_10_420_27_LD 23.0 23.0 0.0%

The current perf.py performance is poor as the DCT's effect on overall decoding performance is dominated by the larger sizes which have not yet been implemented. The decrease in performance is explained by the additional overhead of optimising at the 2D level, the benefits of which are not being reaped here. As the larger sizes are implemented, performance will increase dramatically, in line with the checkasm benchmark result.

nuomi2021 commented 10 months ago

hi @frankplow , seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t. I will try to make some changes to this. Hope this will reduce the porting efforts.

frankplow commented 9 months ago

hi @frankplow , seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t. I will try to make some changes to this. Hope this will reduce the porting efforts.

Yeah I think if we take this approach, it shouldn't be too hard to get transforms implemented for the square sizes. Unfortunately, I think it will be hard to extend the HEVC optimisations to rectangular sizes and MTS as the way it's written doesn't facilitate much code reuse/modularity. I have a branch where I've worked on a more modular optimisation, based on some of the custom ABI ideas dav1d uses, but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment. I think then, the best way to get optimisations in for these most common square sizes is to, as you say, allow varying the coeff type based on whether the range extension is active and then port the HEVC transforms.

nuomi2021 commented 9 months ago

but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment

No worries. I will continue your work after I have done the thread optimizations.