ffvvc / FFmpeg

VVC Decoder for ffmpeg
Other
50 stars 12 forks source link

WIP: dav1d ITX AVX2 Assembly Port #117

Closed frankplow closed 1 year ago

frankplow commented 1 year ago

This PR will port dav1d's AVX2 inverse transform optimisations.

Jamaika1 commented 1 year ago
vvcdsp_init.c:264:33: error: assignment to expression with array type
  264 |     c->itx.itx[DCT2][0]         = ff_vvc_inv_dct2_2_##opt;                      \
      |                                 ^
vvcdsp_init.c:279:13: note: in expansion of macro 'IDCT2_INIT'
  279 |             IDCT2_INIT(avx2);
      |             ^~~~~~~~~~

Can add definition

if HAVE_AVX2_EXTERNAL

...

endif

frankplow commented 1 year ago

@Jamaika1

vvcdsp_init.c:264:33: error: assignment to expression with array type
  264 |     c->itx.itx[DCT2][0]         = ff_vvc_inv_dct2_2_##opt;                      \
      |                                 ^
vvcdsp_init.c:279:13: note: in expansion of macro 'IDCT2_INIT'
  279 |             IDCT2_INIT(avx2);
      |             ^~~~~~~~~~

Can add definition #if HAVE_AVX2_EXTERNAL ... #endif

I'm sorry I don't see where his code is from - there are currently no AVX2 optimisations implemented on this branch and it looks to me like the version of vvcdsp_init.c this is referencing is from my frankplow:idct-asm branch.

frankplow commented 1 year ago

Rebase and re-target onto new main, to reflect #116.

frankplow commented 1 year ago

@nuomi2021 Not urgent at the minute, but I thought I should make a comment about this before I forget it. Can you think of a better way to deal with the memory for transform blocks' transform coefficients and pixels?

After 36e417619cc69865cf705ca02ff388fce4bf9a94 the transform coefficients and pixels are allocated separately in the VVCFrameContext: https://github.com/ffvvc/FFmpeg/blob/36e417619cc69865cf705ca02ff388fce4bf9a94/libavcodec/vvc/vvcdec.c#L98-L103 However this is redundant — pixels is not used before the transform and coeffs is not used afterwards.

I wondered if it would be possible to allocate for the coefficients statically for each transform block, for example at the top of itransform, however this is not possible as the CABAC must be able to put data into the coefficients array: https://github.com/ffvvc/FFmpeg/blob/2a7d709ad771c8cba4b14fb630b378740253ab27/libavcodec/vvc/vvc_cabac.c#L2145-L2152

Perhaps there is some way the same memory could be used for both pixels and coeffs, as before, but reinterpreted as int16_ts or ints at different points in the program? The size of a pixel is always <= the size of a transform coefficient in the current version of VVC, the extreme case being if the bit depth is 16 and extended_precision_flag is 0, then both are 16 bits. Would just setting tb->pixels = (int16_t *) tb->coeffs be acceptable given this is the case?

frankplow commented 1 year ago

@nuomi2021 I have spent some more time trying to port the dav1d assembly to FFVVC, and I have come to the conclusion that it will not be possible. The reasoning for this is quite nuanced and mathematical so please bear with me. I am sure you are familiar with a good deal of the background here but I have added some for quick reference and to aid any other readers.

Any N-point 1D DCT (-II) can be expressed as a matrix multiplication, requiring N² multiplications and N² - N additions. This is not the most efficient way to compute the DCT however — the elements of the transform matrix are values of the cosine function, and so are not independent. In particular, the matrix exhibits various types of redundancies. This includes simple symmetries which can be accounted for using butterfly algorithms, as well as more complex relationships which rely on trigonometric identities. Fast DCT algorithms leverage these redundancies to compute the DCT with fewer multiplications and additions. dav1d in particular uses the Chen algorithm [1] to compute the DCT. A more recent description of this algorithm can be found in [2].

Modern video coding standards do not use the actual DCT, whose transform coefficients are real numbers between -1 and 1, but instead an approximation of it using only integers. This is done for a number of pragmatic reasons centring around efficiency and consistency. Simply multiplying the DCT by a constant and rounding to the nearest integer produces an accurate approximation of the DCT and therefore a good compression performance, however destroys many of the redundancies in the DCT. The design of an effective DCT approximation is a bit of an art form involving the balancing of these two competing factors: accuracy to the DCT, which results in good compression, and preservation of redundancies, which results in good performance.

Unfortunately, the approximation used in HEVC and the derived approximation in VVC, do not preserve the trigonometric redundancies required to employ the Chen algorithm used in the dav1d transform optimisation [3, section 6.2.2]. This makes the dav1d ITX assembly unusable for VVC.

I should have caught this earlier really, but my plan now is to revise my PR #114 to use the work done in https://github.com/ffvvc/FFmpeg/pull/117/commits/ea7a0ceec6041c12af370f6056cdf2939e0f82f5 and https://github.com/ffvvc/FFmpeg/pull/117/commits/36e417619cc69865cf705ca02ff388fce4bf9a94, as well as some other ideas I have seen in my work with the dav1d assembly. I hope I can make these optimisations several times faster than they already are, but fundamentally the speed is going to be limited by the slower Hung algorithm that is supported by VVC

[1] Wen-Hsiung Chen, C. Smith, S.Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, DOI 10.1109/TCOM.1977.1093941, (free pdf) [2] C. J. Tablada, T. L. T. da Silveira, R. J. Cintra, F. M. Bayer, DCT Approximations Based on Chen's Factorization, DOI 10.1016/j.image.2017.06.014, (free preprint) [3] Vivienne Sze, Madhukar Budagavi, Gary J. Sullivan, High Efficiency Video Coding (HEVC) Algorithms and Architectures, DOI 10.1007/978-3-319-06895-4

nuomi2021 commented 1 year ago

@nuomi2021 Not urgent at the minute, but I thought I should make a comment about this before I forget it. Can you think of a better way to deal with the memory for transform blocks' transform coefficients and pixels?

After 36e4176 the transform coefficients and pixels are allocated separately in the VVCFrameContext:

https://github.com/ffvvc/FFmpeg/blob/36e417619cc69865cf705ca02ff388fce4bf9a94/libavcodec/vvc/vvcdec.c#L98-L103

However this is redundant — pixels is not used before the transform and coeffs is not used afterwards.

Not possible. coeffs has a different layout than pixels. Suppose you have 4 64x64 blocks. each block has 1 coeffs. We only use the first 4 coeffs. combine coeffs buffer with pixels will overite the second blocks coeffs

I wondered if it would be possible to allocate for the coefficients statically for each transform block, for example at the top of itransform, however this is not possible as the CABAC must be able to put data into the coefficients array:

https://github.com/ffvvc/FFmpeg/blob/2a7d709ad771c8cba4b14fb630b378740253ab27/libavcodec/vvc/vvc_cabac.c#L2145-L2152

Perhaps there is some way the same memory could be used for both pixels and coeffs, as before, but reinterpreted as int16_ts or ints at different points in the program? The size of a pixel is always <= the size of a transform coefficient in the current version of VVC, the extreme case being if the bit depth is 16 and extended_precision_flag is 0, then both are 16 bits. Would just setting tb->pixels = (int16_t *) tb->coeffs be acceptable given this is the case?

Could you explain why you want to do this? thank you

nuomi2021 commented 1 year ago

@nuomi2021 I have spent some more time trying to port the dav1d assembly to FFVVC, and I have come to the conclusion that it will not be possible. The reasoning for this is quite nuanced and mathematical so please bear with me. I am sure you are familiar with a good deal of the background here but I have added some for quick reference and to aid any other readers.

Any N-point 1D DCT (-II) can be expressed as a matrix multiplication, requiring N² multiplications and N² - N additions. This is not the most efficient way to compute the DCT however — the elements of the transform matrix are values of the cosine function, and so are not independent. In particular, the matrix exhibits various types of redundancies. This includes simple symmetries which can be accounted for using butterfly algorithms, as well as more complex relationships which rely on trigonometric identities. Fast DCT algorithms leverage these redundancies to compute the DCT with fewer multiplications and additions. dav1d in particular uses the Chen algorithm [1] to compute the DCT. A more recent description of this algorithm can be found in [2].

Modern video coding standards do not use the actual DCT, whose transform coefficients are real numbers between -1 and 1, but instead an approximation of it using only integers. This is done for a number of pragmatic reasons centring around efficiency and consistency. Simply multiplying the DCT by a constant and rounding to the nearest integer produces an accurate approximation of the DCT and therefore a good compression performance, however destroys many of the redundancies in the DCT. The design of an effective DCT approximation is a bit of an art form involving the balancing of these two competing factors: accuracy to the DCT, which results in good compression, and preservation of redundancies, which results in good performance.

Unfortunately, the approximation used in HEVC and the derived approximation in VVC, do not preserve the trigonometric redundancies required to employ the Chen algorithm used in the dav1d transform optimisation [3, section 6.2.2]. This makes the dav1d ITX assembly unusable for VVC.

I should have caught this earlier really, but my plan now is to revise my PR #114 to use the work done in ea7a0ce and 36e4176, as well as some other ideas I have seen in my work with the dav1d assembly. I hope I can make these optimisations several times faster than they already are, but fundamentally the speed is going to be limited by the slower Hung algorithm that is supported by VVC Thank you for your hardwork.

No worries. Please continue working on https://github.com/ffvvc/FFmpeg/pull/114 In the meaning time, could you check vvdec c and asm code?

[1] Wen-Hsiung Chen, C. Smith, S.Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, DOI 10.1109/TCOM.1977.1093941, (free pdf) [2] C. J. Tablada, T. L. T. da Silveira, R. J. Cintra, F. M. Bayer, DCT Approximations Based on Chen's Factorization, DOI 10.1016/j.image.2017.06.014, (free preprint) [3] Vivienne Sze, Madhukar Budagavi, Gary J. Sullivan, High Efficiency Video Coding (HEVC) Algorithms and Architectures, DOI 10.1007/978-3-319-06895-4

frankplow commented 1 year ago

@nuomi2021

No worries. Please continue working on #114 In the meaning time, could you check vvdec c and asm code?

vvdec has a very straightforward implementation in assembly. It does not use any fast algorithm like Hung, instead the transform is a simple matrix multiplication. I have read that in some instances this may be faster as it is more readily parallelisable, but I am not convinced — using perf.py I measure only a 1.1% performance improvement when toggling vvdec's transform optimisations, which is less than the difference I got with my optimisations using the Hung algorithm in #114. Additionally, it would be quite a lot of work to port these, as they are written using intrinsics and C++ templates.

I've also looked into adapting the assembly optimisations already in FFmpeg for HEVC. These are written directly in NASM assembly using x86inc.asm, so would require less work to port on that front. Additionally, they use the Hung algorithm, which theoretically should speed them up. The big obstacle in adapting them is the fact that they use 16-bit integers for both pixels and transform coefficients. This is not suitable for VVC as when sps_extended_precision_flag is set, the transform coefficients are bit_depth + 6 and so can be up to 22-bit. I can see two ways around this:

  1. The easiest fix is to convert the 32-bit transform coefficients to 16-bit within the functions when possible, and disable the optimisations otherwise. This is complicated by the fact that these optimisations do a lot of writing to and from memory, and so there is no one easy place where movas to load from memory can be replaced with packssdws for example. I have already implemented the naive approach of doing O(n²) packssdws at the beginning of the function, however this degraded the performance to below the C implementation. (see correction below)
  2. Alternatively, the function prototype itself could be changed to use 16-bit transform coefficients, either when the bit depth <= 10 (this simplies sps_extended_precision_flag = 0) or based on sps_extended_precision_flag itself. This wouldn't require major changes to the assembly implementation, however there is currently no mechanism to have function prototypes change based on bit depth. I think this is the more worthwhile solution, as the ability to use smaller types when possible enables faster optimisations, but how much work do you think this would be?

In either case, the 2x2, 64x64, 1D and rectangular transforms would also need to be implemented as these are not present in HEVC.


Correction: My checkasm test was flawed and this is not the case. My work done porting these optimisations, as well as the corrected checkasm benchmark, can be found on #130.