Closed 1div0 closed 1 year ago
Interesting. Are both bitstreams decodable? You are encoding with the DPH SEI enabled. Is it correctly reconstructed by the decoder?
It would be interesting to know where the difference comes from. Could you check if the AOCC executable provides the same result with --SIMD=SCALAR. If not, there is an implementation problem somewhere. If yes, the difference is probably caused by some floating point calculation instability, which would be annoying but acceptable.
All bitstreams with QP equal to 22, 27, 32, 37, 42, 47 are perfectly decodable with the VVdeC version 1.3.0.
I will restart the encoding with SIMD scalar and check the results later today.
[peter.kovar@vmi728485 ~]$ VVenC.sh
+ COMPILER=GCC
+ VERSION=8.5.0
+ CONFIGURATION=GCC/8.5.0
+ ENCODER=/usr/local/GCC/8.5.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/GCC/8.5.0
+ mkdir -p /home/peter.kovar/Video/VVC/GCC/8.5.0
+ HORIZONTAL=3840
+ VERTICAL=2160
+ SIZE=3840x2160
+ RATE=50
+ NAME=1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266.log
+ nice /usr/local/GCC/8.5.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv --Size 3840x2160 --framerate 50 --InputBitDepth 10 --QP 32 --SIMD=SCALAR --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266
real 178m35,012s
user 1031m1,738s
sys 3m47,406s
[peter.kovar@vmi728485 ~]$ vim Scripts/VVenC.sh
[peter.kovar@vmi728485 ~]$ VVenC.sh
+ COMPILER=AOCC
+ VERSION=3.2.0
+ CONFIGURATION=AOCC/3.2.0
+ ENCODER=/usr/local/AOCC/3.2.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/AOCC/3.2.0
+ mkdir -p /home/peter.kovar/Video/VVC/AOCC/3.2.0
+ HORIZONTAL=3840
+ VERTICAL=2160
+ SIZE=3840x2160
+ RATE=50
+ NAME=1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266.log
+ nice /usr/local/AOCC/3.2.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv --Size 3840x2160 --framerate 50 --InputBitDepth 10 --QP 32 --SIMD=SCALAR --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266
real 144m1,126s
user 864m45,966s
sys 2m23,265s
€ diff -u '/run/user/1001/gvfs/sftp:host=düsseldorf.reflexion.tv/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTERSVTdec05.QP32.266.log' '/run/user/1001/gvfs/sftp:host=düsseldorf.reflexion.tv/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTERSVTdec05.QP32.266.log' > ~/"GCC 8.5.0 versus AOCC 3.2.0 comparison.diff.txt"
GCC 8.5.0 versus AOCC 3.2.0 comparison.diff.txt
AOCC generated encoder was 24% faster.!?
Thanks for checking for conformance. Without SIMD the results seems to be the same, but there is actually some floating point SIMD in the encoder. So even so, the difference could also potentially be uncritical. I'll try to track it down sometime, but it seems uncritical - probably floating point operation influencing an encoding decision.
The speed-up really is impressive. Thanks for sharing! I'm actually surprised because with the amount of manual optimization we did, I didn't think an architecture optimizing compiler would matter so much. It'd be interesting to see the profiling to get an idea where AOCC was able to optimize so much (e.g. with ENABLE_TIME_PROFILING). I might have a look sometime.
We cannot really act on it though.
If you want to simplify your build process to utilize this, you can specify the target arch directly in the make-cmd as:
$ make clean
$ make release ... enable-arch=znver3
There is not utilized AVX3-512 yet. I will try PGO during this week and share the measured results.
AVX2 brings max 10% over SSE42, so I wouldnt get my hopes up for AVX512.
If you find a way to automate PGO as a part of our CMake build process, feel free to make a pull request. Looking forward to the results.
It is not easy.
ccmake ../../../../..
CCACHE_FOUND /usr/bin/ccache
CMAKE_ADDR2LINE /usr/bin/addr2line
CMAKE_AR TALL_PREFIX /usr/bin/ar OCC/3.2.0
CMAKE_BUILD_TYPE BLE_ITT Debug
CMAKE_COLOR_MAKEFILE ON
CMAKE_CXX_COMPILER /opt/AMD/aocc-compiler-3.2.0/bin/clang++
CMAKE_CXX_COMPILER_AR /opt/AMD/aocc-compiler-3.2.0/bin/llvm-ar
CMAKE_CXX_COMPILER_RANLIB /opt/AMD/aocc-compiler-3.2.0/bin/llvm-ranlib
CMAKE_CXX_FLAGS -march=znver3 -flto -Ofast -mllvm -enable-strided-vectorization
CMAKE_CXX_FLAGS_DEBUG -g -fprofile-instr-generate
CMAKE_CXX_FLAGS_MINSIZEREL -Os -DNDEBUG
CMAKE_CXX_FLAGS_PROFILE -O0 -fprofile-instr-generate
CMAKE_CXX_FLAGS_RELEASE -O3 -DNDEBUG -fprofile-instr-use
CMAKE_CXX_FLAGS_RELWITHDEBINFO -O2 -g -DNDEBUG
time make --jobs 8
real 9m1,768s
user 36m5,663s
sys 0m55,690s
/opt/AMD/aocc-compiler-3.2.0/bin/llvm-profdata merge -output=default.profdata default.profraw
CMAKE_CXX_FLAGS_RELEASE -O3 -fprofile-instr-use=/usr/src/github.com/1div0/vvenc/Linux/x86-64/EPYC/AOCC/3.2.0/default.profdata
[peter.kovar@vmi728485 3.2.0]$ VVenC.sh
+ COMPILER=AOCC
+ VERSION=3.2.0
+ CONFIGURATION=AOCC/3.2.0
+ ENCODER=/usr/local/AOCC/3.2.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/AOCC/3.2.0
+ mkdir -p /home/peter.kovar/Video/VVC/AOCC/3.2.0
+ HORIZONTAL=1920
+ VERTICAL=1080
+ SIZE=1920x1080
+ RATE=24
+ NAME=Kimono1_1920x1080_24
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/Kimono1_1920x1080_24.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266.log
+ nice /usr/local/AOCC/3.2.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/Kimono1_1920x1080_24.yuv --Size 1920x1080 --framerate 24 --InputBitDepth 8 --QP 32 --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266
real 405m3,083s
user 2877m46,438s
sys 1m20,165s
/opt/AMD/aocc-compiler-3.2.0/bin/llvm-profdata merge -output=default.profdata default.profraw
[peter.kovar@vmi728485 3.2.0]$ file default.prof*
default.profdata: LLVM indexed profile data, version 7
default.profraw: LLVM raw profile data, version 7
[peter.kovar@vmi728485 3.2.0]$ time make --jobs 8
[ 1%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/ParseArg.cpp.o
[ 1%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/YuvFileIO.cpp.o
[ 2%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/VVEncAppCfg.cpp.o
[ 3%] Linking CXX static library ../../../../../../../../lib/release-static/libapputils.a
[ 3%] Built target apputils
[ 3%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/AdaptiveLoopFilter.cpp.o
[ 4%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/AffineGradientSearch.cpp.o
[ 5%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/BitStream.cpp.o
[ 5%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/CodingStructure.cpp.o
[ 7%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/ContextModelling.cpp.o
[ 7%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Buffer.cpp.o
[ 8%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Contexts.cpp.o
[ 9%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/DepQuant.cpp.o
[ 9%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/InterPrediction.cpp.o
[ 10%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/InterpolationFilter.cpp.o
[ 11%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/IntraPrediction.cpp.o
[ 12%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/LoopFilter.cpp.o
[ 12%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/MCTF.cpp.o
[ 13%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/MatrixIntraPrediction.cpp.o
[ 14%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Mv.cpp.o
[ 15%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/PicYuvMD5.cpp.o
[ 15%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Picture.cpp.o
[ 16%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/ProfileLevelTier.cpp.o
[ 17%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Quant.cpp.o
[ 18%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/QuantRDOQ.cpp.o
[ 18%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/QuantRDOQ2.cpp.o
[ 19%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/RdCost.cpp.o
[ 20%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Reshape.cpp.o
[ 21%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Rom.cpp.o
[ 21%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/RomTr.cpp.o
[ 22%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SEI.cpp.o
[ 23%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SampleAdaptiveOffset.cpp.o
[ 24%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SearchSpaceCounter.cpp.o
[ 24%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Slice.cpp.o
[ 25%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o
[ 26%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TimeProfiler.cpp.o
[ 27%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TrQuant.cpp.o
[ 27%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TrQuant_EMT.cpp.o
[ 28%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Unit.cpp.o
[ 29%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/UnitPartitioner.cpp.o
error: no profile data available for file "StatCounter.cpp" [-Werror,-Wprofile-instr-unprofiled]
1 error generated.
make[2]: *** [source/Lib/vvenc/CMakeFiles/vvenc.dir/build.make:482: source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:188: source/Lib/vvenc/CMakeFiles/vvenc.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
real 0m50,479s
user 3m9,502s
sys 0m22,095s
@adamjw24 Am I doing something wrong here?
Hmm... from the log files I understand that to do a profile based build, you need profiling info for every object? This will not be possible with vvenc for following reason:
Oh, wait, I just had a second look, and found the following:
[ 29%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/UnitPartitioner.cpp.o
error: no profile data available for file "StatCounter.cpp" [-Werror,-Wprofile-instr-unprofiled]
1 error generated.
make[2]: *** [source/Lib/vvenc/CMakeFiles/vvenc.dir/build.make:482: source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
It looks like you should add the following flag to the build: -Wnoprofile-instr-unprofiled
(or however the syntax is to disable -Wprofile-instr-unprofiled
)
Dziękuję.
-Wno-profile-instr-unprofiled
And compiler frontend just exploded.
Going to report that to the AMD. EncCu-daf955.cpp.txt EncCu-daf955.sh.txt
Proszę.
You might try with an older clang version. We had some some issues with bleeding edge compilers a few times already.
Closed accidentally. I misread the issue number to close.
AOCC 3.1.0 based on LLVM 12.0.0 just compiled OK.
//Flags used by the CXX compiler during RELEASE builds. CMAKE_CXX_FLAGS_RELEASE:STRING=-Ofast -flto -mllvm -enable-strided-vectorization -Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -fprofile-instr-use=/usr/src/github.com/1div0/vvenc/Linux/x86-64/EPYC/AOCC/3.1.0/default.profdata
Recently, I have compiled the LLVM v15 Clang compiler and discovered this: `/usr/src/github.com/1div0/vvenc/source/Lib/EncoderLib/IntraSearch.cpp:2509:27: error: use of bitwise ' | ' with boolean operands [-Werror,-Wbitwise-instead-of-logical] currTU.jointCbCr = (TU::getCbf(currTU, COMP_Cb) | TU::getCbf(currTU, COMP_Cr)) ? bestJointCbCr : 0;
~^ |
---|
/usr/src/github.com/1div0/vvenc/source/Lib/EncoderLib/IntraSearch.cpp:2509:27: note: cast one or both operands to int to silence this warning `
I think clang v15 is way too early to take its warnings seriously. We had a lot of problems with early compiler version, and I'd rather wait out for those to mature a bit.
Might pull in the PR tho, since it seems logical (no pun intended).
So far no word from the AMD about compiler crash. However, these results are confirming my observation. https://www.phoronix.com/scan.php?page=article&item=amd-aocc-milanx&num=4
Closing for now, as not really actionable. This is more informative.
GCC flags: -flto -O3 1_CrowdRun_2160p50_CgrLevels_MASTERSVTdec05.QP22.266.log
AOCC flags: -march=znver3 -flto -Ofast -mllvm -enable-strided-vectorization 1_CrowdRun_2160p50_CgrLevels_MASTERSVTdec05.QP22.266.log
In other words, AOCC produced 21% faster code than GCC.