Closed frankplow closed 1 year ago
vvcdsp_init.c:264:33: error: assignment to expression with array type
264 | c->itx.itx[DCT2][0] = ff_vvc_inv_dct2_2_##opt; \
| ^
vvcdsp_init.c:279:13: note: in expansion of macro 'IDCT2_INIT'
279 | IDCT2_INIT(avx2);
| ^~~~~~~~~~
Can add definition
...
@Jamaika1
vvcdsp_init.c:264:33: error: assignment to expression with array type 264 | c->itx.itx[DCT2][0] = ff_vvc_inv_dct2_2_##opt; \ | ^ vvcdsp_init.c:279:13: note: in expansion of macro 'IDCT2_INIT' 279 | IDCT2_INIT(avx2); | ^~~~~~~~~~
Can add definition #if HAVE_AVX2_EXTERNAL ... #endif
I'm sorry I don't see where his code is from - there are currently no AVX2 optimisations implemented on this branch and it looks to me like the version of vvcdsp_init.c this is referencing is from my frankplow:idct-asm branch.
Rebase and re-target onto new main, to reflect #116.
@nuomi2021 Not urgent at the minute, but I thought I should make a comment about this before I forget it. Can you think of a better way to deal with the memory for transform blocks' transform coefficients and pixels?
After 36e417619cc69865cf705ca02ff388fce4bf9a94 the transform coefficients and pixels are allocated separately in the VVCFrameContext
:
https://github.com/ffvvc/FFmpeg/blob/36e417619cc69865cf705ca02ff388fce4bf9a94/libavcodec/vvc/vvcdec.c#L98-L103
However this is redundant — pixels
is not used before the transform and coeffs
is not used afterwards.
I wondered if it would be possible to allocate for the coefficients statically for each transform block, for example at the top of itransform
, however this is not possible as the CABAC must be able to put data into the coefficients array:
https://github.com/ffvvc/FFmpeg/blob/2a7d709ad771c8cba4b14fb630b378740253ab27/libavcodec/vvc/vvc_cabac.c#L2145-L2152
Perhaps there is some way the same memory could be used for both pixels
and coeffs
, as before, but reinterpreted as int16_t
s or int
s at different points in the program? The size of a pixel is always <= the size of a transform coefficient in the current version of VVC, the extreme case being if the bit depth is 16 and extended_precision_flag
is 0, then both are 16 bits. Would just setting tb->pixels = (int16_t *) tb->coeffs
be acceptable given this is the case?
@nuomi2021 I have spent some more time trying to port the dav1d assembly to FFVVC, and I have come to the conclusion that it will not be possible. The reasoning for this is quite nuanced and mathematical so please bear with me. I am sure you are familiar with a good deal of the background here but I have added some for quick reference and to aid any other readers.
Any N-point 1D DCT (-II) can be expressed as a matrix multiplication, requiring N² multiplications and N² - N additions. This is not the most efficient way to compute the DCT however — the elements of the transform matrix are values of the cosine function, and so are not independent. In particular, the matrix exhibits various types of redundancies. This includes simple symmetries which can be accounted for using butterfly algorithms, as well as more complex relationships which rely on trigonometric identities. Fast DCT algorithms leverage these redundancies to compute the DCT with fewer multiplications and additions. dav1d in particular uses the Chen algorithm [1] to compute the DCT. A more recent description of this algorithm can be found in [2].
Modern video coding standards do not use the actual DCT, whose transform coefficients are real numbers between -1 and 1, but instead an approximation of it using only integers. This is done for a number of pragmatic reasons centring around efficiency and consistency. Simply multiplying the DCT by a constant and rounding to the nearest integer produces an accurate approximation of the DCT and therefore a good compression performance, however destroys many of the redundancies in the DCT. The design of an effective DCT approximation is a bit of an art form involving the balancing of these two competing factors: accuracy to the DCT, which results in good compression, and preservation of redundancies, which results in good performance.
Unfortunately, the approximation used in HEVC and the derived approximation in VVC, do not preserve the trigonometric redundancies required to employ the Chen algorithm used in the dav1d transform optimisation [3, section 6.2.2]. This makes the dav1d ITX assembly unusable for VVC.
I should have caught this earlier really, but my plan now is to revise my PR #114 to use the work done in https://github.com/ffvvc/FFmpeg/pull/117/commits/ea7a0ceec6041c12af370f6056cdf2939e0f82f5 and https://github.com/ffvvc/FFmpeg/pull/117/commits/36e417619cc69865cf705ca02ff388fce4bf9a94, as well as some other ideas I have seen in my work with the dav1d assembly. I hope I can make these optimisations several times faster than they already are, but fundamentally the speed is going to be limited by the slower Hung algorithm that is supported by VVC
[1] Wen-Hsiung Chen, C. Smith, S.Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, DOI 10.1109/TCOM.1977.1093941, (free pdf) [2] C. J. Tablada, T. L. T. da Silveira, R. J. Cintra, F. M. Bayer, DCT Approximations Based on Chen's Factorization, DOI 10.1016/j.image.2017.06.014, (free preprint) [3] Vivienne Sze, Madhukar Budagavi, Gary J. Sullivan, High Efficiency Video Coding (HEVC) Algorithms and Architectures, DOI 10.1007/978-3-319-06895-4
@nuomi2021 Not urgent at the minute, but I thought I should make a comment about this before I forget it. Can you think of a better way to deal with the memory for transform blocks' transform coefficients and pixels?
After 36e4176 the transform coefficients and pixels are allocated separately in the
VVCFrameContext
:However this is redundant —
pixels
is not used before the transform andcoeffs
is not used afterwards.
Not possible. coeffs has a different layout than pixels. Suppose you have 4 64x64 blocks. each block has 1 coeffs. We only use the first 4 coeffs. combine coeffs buffer with pixels will overite the second blocks coeffs
I wondered if it would be possible to allocate for the coefficients statically for each transform block, for example at the top of
itransform
, however this is not possible as the CABAC must be able to put data into the coefficients array:Perhaps there is some way the same memory could be used for both
pixels
andcoeffs
, as before, but reinterpreted asint16_t
s orint
s at different points in the program? The size of a pixel is always <= the size of a transform coefficient in the current version of VVC, the extreme case being if the bit depth is 16 andextended_precision_flag
is 0, then both are 16 bits. Would just settingtb->pixels = (int16_t *) tb->coeffs
be acceptable given this is the case?
Could you explain why you want to do this? thank you
@nuomi2021 I have spent some more time trying to port the dav1d assembly to FFVVC, and I have come to the conclusion that it will not be possible. The reasoning for this is quite nuanced and mathematical so please bear with me. I am sure you are familiar with a good deal of the background here but I have added some for quick reference and to aid any other readers.
Any N-point 1D DCT (-II) can be expressed as a matrix multiplication, requiring N² multiplications and N² - N additions. This is not the most efficient way to compute the DCT however — the elements of the transform matrix are values of the cosine function, and so are not independent. In particular, the matrix exhibits various types of redundancies. This includes simple symmetries which can be accounted for using butterfly algorithms, as well as more complex relationships which rely on trigonometric identities. Fast DCT algorithms leverage these redundancies to compute the DCT with fewer multiplications and additions. dav1d in particular uses the Chen algorithm [1] to compute the DCT. A more recent description of this algorithm can be found in [2].
Modern video coding standards do not use the actual DCT, whose transform coefficients are real numbers between -1 and 1, but instead an approximation of it using only integers. This is done for a number of pragmatic reasons centring around efficiency and consistency. Simply multiplying the DCT by a constant and rounding to the nearest integer produces an accurate approximation of the DCT and therefore a good compression performance, however destroys many of the redundancies in the DCT. The design of an effective DCT approximation is a bit of an art form involving the balancing of these two competing factors: accuracy to the DCT, which results in good compression, and preservation of redundancies, which results in good performance.
Unfortunately, the approximation used in HEVC and the derived approximation in VVC, do not preserve the trigonometric redundancies required to employ the Chen algorithm used in the dav1d transform optimisation [3, section 6.2.2]. This makes the dav1d ITX assembly unusable for VVC.
I should have caught this earlier really, but my plan now is to revise my PR #114 to use the work done in ea7a0ce and 36e4176, as well as some other ideas I have seen in my work with the dav1d assembly. I hope I can make these optimisations several times faster than they already are, but fundamentally the speed is going to be limited by the slower Hung algorithm that is supported by VVC Thank you for your hardwork.
No worries. Please continue working on https://github.com/ffvvc/FFmpeg/pull/114 In the meaning time, could you check vvdec c and asm code?
[1] Wen-Hsiung Chen, C. Smith, S.Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, DOI 10.1109/TCOM.1977.1093941, (free pdf) [2] C. J. Tablada, T. L. T. da Silveira, R. J. Cintra, F. M. Bayer, DCT Approximations Based on Chen's Factorization, DOI 10.1016/j.image.2017.06.014, (free preprint) [3] Vivienne Sze, Madhukar Budagavi, Gary J. Sullivan, High Efficiency Video Coding (HEVC) Algorithms and Architectures, DOI 10.1007/978-3-319-06895-4
@nuomi2021
No worries. Please continue working on #114 In the meaning time, could you check vvdec c and asm code?
vvdec has a very straightforward implementation in assembly. It does not use any fast algorithm like Hung, instead the transform is a simple matrix multiplication. I have read that in some instances this may be faster as it is more readily parallelisable, but I am not convinced — using perf.py
I measure only a 1.1% performance improvement when toggling vvdec's transform optimisations, which is less than the difference I got with my optimisations using the Hung algorithm in #114. Additionally, it would be quite a lot of work to port these, as they are written using intrinsics and C++ templates.
I've also looked into adapting the assembly optimisations already in FFmpeg for HEVC. These are written directly in NASM assembly using x86inc.asm, so would require less work to port on that front. Additionally, they use the Hung algorithm, which theoretically should speed them up. The big obstacle in adapting them is the fact that they use 16-bit integers for both pixels and transform coefficients. This is not suitable for VVC as when sps_extended_precision_flag
is set, the transform coefficients are bit_depth + 6
and so can be up to 22-bit. I can see two ways around this:
mova
s to load from memory can be replaced with packssdw
s for example. I have already implemented the naive approach of doing O(n²) packssdw
s at the beginning of the function, sps_extended_precision_flag = 0
) or based on sps_extended_precision_flag
itself. This wouldn't require major changes to the assembly implementation, however there is currently no mechanism to have function prototypes change based on bit depth. I think this is the more worthwhile solution, as the ability to use smaller types when possible enables faster optimisations, but how much work do you think this would be?In either case, the 2x2, 64x64, 1D and rectangular transforms would also need to be implemented as these are not present in HEVC.
Correction: My checkasm test was flawed and this is not the case. My work done porting these optimisations, as well as the corrected checkasm benchmark, can be found on #130.
This PR will port dav1d's AVX2 inverse transform optimisations.