GCC versus AOCC optimization

1div0 commented 2 years ago


    Total Frames |   Bitrate     Y-PSNR    U-PSNR    V-PSNR    YUV-PSNR   
-         500    a   69383.0744   36.6916   39.6481   40.8076   37.5261
-finished @ Sat Jan 22 21:21:27 2022
+         500    a   69383.2840   36.6916   39.6481   40.8076   37.5261
+finished @ Sun Jan 23 11:44:28 2022

-Total Time: 14645.331 sec. Fps(avg): 0.034 encoded Frames 500
+Total Time: 12097.595 sec. Fps(avg): 0.041 encoded Frames 500

GCC flags: -flto -O3 1_CrowdRun_2160p50_CgrLevels_MASTERSVTdec05.QP22.266.log

AOCC flags: -march=znver3 -flto -Ofast -mllvm -enable-strided-vectorization 1_CrowdRun_2160p50_CgrLevels_MASTERSVTdec05.QP22.266.log

Snímka obrazovky z 2022-01-23 17-22-49

In other words, AOCC produced 21% faster code than GCC.

adamjw24 commented 2 years ago

Interesting. Are both bitstreams decodable? You are encoding with the DPH SEI enabled. Is it correctly reconstructed by the decoder?

It would be interesting to know where the difference comes from. Could you check if the AOCC executable provides the same result with --SIMD=SCALAR. If not, there is an implementation problem somewhere. If yes, the difference is probably caused by some floating point calculation instability, which would be annoying but acceptable.

1div0 commented 2 years ago

All bitstreams with QP equal to 22, 27, 32, 37, 42, 47 are perfectly decodable with the VVdeC version 1.3.0.

I will restart the encoding with SIMD scalar and check the results later today.

1div0 commented 2 years ago

[peter.kovar@vmi728485 ~]$ VVenC.sh 
+ COMPILER=GCC
+ VERSION=8.5.0
+ CONFIGURATION=GCC/8.5.0
+ ENCODER=/usr/local/GCC/8.5.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/GCC/8.5.0
+ mkdir -p /home/peter.kovar/Video/VVC/GCC/8.5.0
+ HORIZONTAL=3840
+ VERTICAL=2160
+ SIZE=3840x2160
+ RATE=50
+ NAME=1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266.log
+ nice /usr/local/GCC/8.5.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv --Size 3840x2160 --framerate 50 --InputBitDepth 10 --QP 32 --SIMD=SCALAR --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266

real    178m35,012s
user    1031m1,738s
sys 3m47,406s
[peter.kovar@vmi728485 ~]$ vim Scripts/VVenC.sh 
[peter.kovar@vmi728485 ~]$ VVenC.sh 
+ COMPILER=AOCC
+ VERSION=3.2.0
+ CONFIGURATION=AOCC/3.2.0
+ ENCODER=/usr/local/AOCC/3.2.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/AOCC/3.2.0
+ mkdir -p /home/peter.kovar/Video/VVC/AOCC/3.2.0
+ HORIZONTAL=3840
+ VERTICAL=2160
+ SIZE=3840x2160
+ RATE=50
+ NAME=1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266.log
+ nice /usr/local/AOCC/3.2.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv --Size 3840x2160 --framerate 50 --InputBitDepth 10 --QP 32 --SIMD=SCALAR --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266

real    144m1,126s
user    864m45,966s
sys 2m23,265s

€ diff -u '/run/user/1001/gvfs/sftp:host=düsseldorf.reflexion.tv/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTERSVTdec05.QP32.266.log' '/run/user/1001/gvfs/sftp:host=düsseldorf.reflexion.tv/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTERSVTdec05.QP32.266.log' > ~/"GCC 8.5.0 versus AOCC 3.2.0 comparison.diff.txt"

GCC 8.5.0 versus AOCC 3.2.0 comparison.diff.txt Snímka obrazovky z 2022-01-24 18-17-01

AOCC generated encoder was 24% faster.!?

adamjw24 commented 2 years ago

Thanks for checking for conformance. Without SIMD the results seems to be the same, but there is actually some floating point SIMD in the encoder. So even so, the difference could also potentially be uncritical. I'll try to track it down sometime, but it seems uncritical - probably floating point operation influencing an encoding decision.

The speed-up really is impressive. Thanks for sharing! I'm actually surprised because with the amount of manual optimization we did, I didn't think an architecture optimizing compiler would matter so much. It'd be interesting to see the profiling to get an idea where AOCC was able to optimize so much (e.g. with ENABLE_TIME_PROFILING). I might have a look sometime.

We cannot really act on it though.

If you want to simplify your build process to utilize this, you can specify the target arch directly in the make-cmd as:

$ make clean
$ make release ... enable-arch=znver3

1div0 commented 2 years ago

There is not utilized AVX3-512 yet. I will try PGO during this week and share the measured results.

adamjw24 commented 2 years ago

AVX2 brings max 10% over SSE42, so I wouldnt get my hopes up for AVX512.

If you find a way to automate PGO as a part of our CMake build process, feel free to make a pull request. Looking forward to the results.

1div0 commented 2 years ago

It is not easy.

ccmake ../../../../..

CCACHE_FOUND                     /usr/bin/ccache
CMAKE_ADDR2LINE                  /usr/bin/addr2line
CMAKE_AR TALL_PREFIX             /usr/bin/ar OCC/3.2.0
CMAKE_BUILD_TYPE BLE_ITT         Debug
CMAKE_COLOR_MAKEFILE ON
CMAKE_CXX_COMPILER               /opt/AMD/aocc-compiler-3.2.0/bin/clang++
CMAKE_CXX_COMPILER_AR            /opt/AMD/aocc-compiler-3.2.0/bin/llvm-ar
CMAKE_CXX_COMPILER_RANLIB /opt/AMD/aocc-compiler-3.2.0/bin/llvm-ranlib
CMAKE_CXX_FLAGS                  -march=znver3 -flto -Ofast -mllvm -enable-strided-vectorization
CMAKE_CXX_FLAGS_DEBUG            -g -fprofile-instr-generate
CMAKE_CXX_FLAGS_MINSIZEREL       -Os -DNDEBUG
CMAKE_CXX_FLAGS_PROFILE          -O0 -fprofile-instr-generate
CMAKE_CXX_FLAGS_RELEASE          -O3 -DNDEBUG -fprofile-instr-use
CMAKE_CXX_FLAGS_RELWITHDEBINFO   -O2 -g -DNDEBUG

time make --jobs 8

real    9m1,768s
user    36m5,663s
sys     0m55,690s

/opt/AMD/aocc-compiler-3.2.0/bin/llvm-profdata merge -output=default.profdata default.profraw

CMAKE_CXX_FLAGS_RELEASE          -O3 -fprofile-instr-use=/usr/src/github.com/1div0/vvenc/Linux/x86-64/EPYC/AOCC/3.2.0/default.profdata

[peter.kovar@vmi728485 3.2.0]$ VVenC.sh 
+ COMPILER=AOCC
+ VERSION=3.2.0
+ CONFIGURATION=AOCC/3.2.0
+ ENCODER=/usr/local/AOCC/3.2.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/AOCC/3.2.0
+ mkdir -p /home/peter.kovar/Video/VVC/AOCC/3.2.0
+ HORIZONTAL=1920
+ VERTICAL=1080
+ SIZE=1920x1080
+ RATE=24
+ NAME=Kimono1_1920x1080_24
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/Kimono1_1920x1080_24.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266.log
+ nice /usr/local/AOCC/3.2.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/Kimono1_1920x1080_24.yuv --Size 1920x1080 --framerate 24 --InputBitDepth 8 --QP 32 --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266

real    405m3,083s
user    2877m46,438s
sys     1m20,165s

/opt/AMD/aocc-compiler-3.2.0/bin/llvm-profdata merge -output=default.profdata default.profraw

[peter.kovar@vmi728485 3.2.0]$ file default.prof*
default.profdata: LLVM indexed profile data, version 7
default.profraw:  LLVM raw profile data, version 7

[peter.kovar@vmi728485 3.2.0]$ time make --jobs 8
[  1%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/ParseArg.cpp.o
[  1%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/YuvFileIO.cpp.o
[  2%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/VVEncAppCfg.cpp.o
[  3%] Linking CXX static library ../../../../../../../../lib/release-static/libapputils.a
[  3%] Built target apputils
[  3%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/AdaptiveLoopFilter.cpp.o
[  4%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/AffineGradientSearch.cpp.o
[  5%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/BitStream.cpp.o
[  5%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/CodingStructure.cpp.o
[  7%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/ContextModelling.cpp.o
[  7%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Buffer.cpp.o
[  8%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Contexts.cpp.o
[  9%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/DepQuant.cpp.o
[  9%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/InterPrediction.cpp.o
[ 10%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/InterpolationFilter.cpp.o
[ 11%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/IntraPrediction.cpp.o
[ 12%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/LoopFilter.cpp.o
[ 12%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/MCTF.cpp.o
[ 13%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/MatrixIntraPrediction.cpp.o
[ 14%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Mv.cpp.o
[ 15%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/PicYuvMD5.cpp.o
[ 15%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Picture.cpp.o
[ 16%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/ProfileLevelTier.cpp.o
[ 17%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Quant.cpp.o
[ 18%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/QuantRDOQ.cpp.o
[ 18%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/QuantRDOQ2.cpp.o
[ 19%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/RdCost.cpp.o
[ 20%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Reshape.cpp.o
[ 21%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Rom.cpp.o
[ 21%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/RomTr.cpp.o
[ 22%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SEI.cpp.o
[ 23%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SampleAdaptiveOffset.cpp.o
[ 24%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SearchSpaceCounter.cpp.o
[ 24%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Slice.cpp.o
[ 25%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o
[ 26%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TimeProfiler.cpp.o
[ 27%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TrQuant.cpp.o
[ 27%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TrQuant_EMT.cpp.o
[ 28%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Unit.cpp.o
[ 29%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/UnitPartitioner.cpp.o
error: no profile data available for file "StatCounter.cpp" [-Werror,-Wprofile-instr-unprofiled]
1 error generated.
make[2]: *** [source/Lib/vvenc/CMakeFiles/vvenc.dir/build.make:482: source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:188: source/Lib/vvenc/CMakeFiles/vvenc.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

real    0m50,479s
user    3m9,502s
sys     0m22,095s

1div0 commented 2 years ago

@adamjw24 Am I doing something wrong here?

adamjw24 commented 2 years ago

Hmm... from the log files I understand that to do a profile based build, you need profiling info for every object? This will not be possible with vvenc for following reason:

the tracing and instrumentation functionalities are disabled per default, so the object would be empty. Also, no one care about the performance of the tracing and the instrumentation.
the decoding functionality is not used most of the time, it would be inpractical to generate profiling data for it only because of the build
stuff like weighted prediction is only used with specific configs, so the files might also not be used (i.e. no profiling data for those files)

adamjw24 commented 2 years ago

Oh, wait, I just had a second look, and found the following:

[ 29%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/UnitPartitioner.cpp.o
error: no profile data available for file "StatCounter.cpp" [-Werror,-Wprofile-instr-unprofiled]
1 error generated.
make[2]: *** [source/Lib/vvenc/CMakeFiles/vvenc.dir/build.make:482: source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....

It looks like you should add the following flag to the build: -Wnoprofile-instr-unprofiled (or however the syntax is to disable -Wprofile-instr-unprofiled)

1div0 commented 2 years ago

Dziękuję.

-Wno-profile-instr-unprofiled

And compiler frontend just exploded.

Going to report that to the AMD. EncCu-daf955.cpp.txt EncCu-daf955.sh.txt

adamjw24 commented 2 years ago

Proszę.

You might try with an older clang version. We had some some issues with bleeding edge compilers a few times already.

adamjw24 commented 2 years ago

Closed accidentally. I misread the issue number to close.

1div0 commented 2 years ago

AOCC 3.1.0 based on LLVM 12.0.0 just compiled OK.

//Flags used by the CXX compiler during RELEASE builds. CMAKE_CXX_FLAGS_RELEASE:STRING=-Ofast -flto -mllvm -enable-strided-vectorization -Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -fprofile-instr-use=/usr/src/github.com/1div0/vvenc/Linux/x86-64/EPYC/AOCC/3.1.0/default.profdata

1div0 commented 2 years ago

Result in https://düsseldorf.reflexion.tv/nextcloud/index.php/s/Y9E884z6wAkNNSZ

1div0 commented 2 years ago

Recently, I have compiled the LLVM v15 Clang compiler and discovered this: `/usr/src/github.com/1div0/vvenc/source/Lib/EncoderLib/IntraSearch.cpp:2509:27: error: use of bitwise '	' with boolean operands [-Werror,-Wbitwise-instead-of-logical] currTU.jointCbCr = (TU::getCbf(currTU, COMP_Cb)	TU::getCbf(currTU, COMP_Cr)) ? bestJointCbCr : 0; ~^~~~~~~~~~~~~~

/usr/src/github.com/1div0/vvenc/source/Lib/EncoderLib/IntraSearch.cpp:2509:27: note: cast one or both operands to int to silence this warning `

adamjw24 commented 2 years ago

I think clang v15 is way too early to take its warnings seriously. We had a lot of problems with early compiler version, and I'd rather wait out for those to mature a bit.

Might pull in the PR tho, since it seems logical (no pun intended).

1div0 commented 2 years ago

So far no word from the AMD about compiler crash. However, these results are confirming my observation. https://www.phoronix.com/scan.php?page=article&item=amd-aocc-milanx&num=4

adamjw24 commented 1 year ago

Closing for now, as not really actionable. This is more informative.

fraunhoferhhi / vvenc

GCC versus AOCC optimization #127