Clybius / aom-av1-lavish

A fork of aom-av1-psy, which is a fork of aomenc. Designed to open up the encoder for hyper-tuning and fidelity.
BSD 2-Clause "Simplified" License
55 stars 5 forks source link

Hang (or extremely slow start) compared to aomenc 3.8.0 #6

Open Cygon opened 5 months ago

Cygon commented 5 months ago

I have built aom-av1-lavish from opmox/mainline-merge (commit 3b4594d81bed823c41ad95a195cd4b321aebdd07) and I'm now trying to compare it against vanilla aomenc 3.8.0.

However, while vanilla aomenc 3.8.0 starts processing and is well on the way a few minutes after entering pass 2, aom-av1-lavish just sits there. It reads lag-in-frames frames, then (at least for 6 hours now) doesn't achieve anything.

This is how I launch both versions (for aomenc 3.8.0, set lag-in-frames to 48 and remove the tune options):

setsid \
        "$aomEncExecutable" \
                "$inputFile" \
                --width=$videoWidth \
                --height=$videoHeight \
                --input-bit-depth=10 \
                --fps=$videoFrameRateNum/$videoFrameRateDenom \
                --i420 \
                --use-16bit-internal \
                --bit-depth=10 \
                --usage=0 \
                --profile=0 \
                --cpu-used=0 \
                --passes=2 \
                --good \
                --end-usage=vbr \
                --target-bitrate=$targetBitRate \
                --kf-max-dist=$keyFrameInterval \
                --threads=32 \
                --row-mt=1 \
                --lag-in-frames=80 \
                --aq-mode=1 \
                --enable-qm=1 \
                --color-primaries=bt709 \
                --transfer-characteristics=bt709 \
                --matrix-coefficients=bt709 \
                --sharpness=2 \
                --arnr-strength=2 \
                --arnr-maxframes=15 \
                --disable-trellis-quant=0 \
                --enable-dnl-denoising=0 \
                --denoise-noise-level=3 \
                --tune=butteraugli \
                --tune-content=psy \
                --quant-b-adapt=1 \
                --webm \
                --output="$outputFile" \
        &> stdout.log \
        &

I know running with cpu-used 0 is a bit bonkers, but I wanted to see what possible at maximum settings.

But vanilla aomenc 3.8.0 estimates about 80 hours (< 4 days) of encoding time, about 2 frames per minute, whereas aom-av1-lavish has, after ~6 hours of waiting, not managed to process even one frame and thus, no estimate.

So I suspect there is a problem when the above combination of parameters is used. Unless the effect of the "butteraugli" tune and/or lag-in-frames at 80 is so drastic that it takes >6 hours to process a single frame.

Cygon commented 5 months ago

I did some further testing:

So it appears that the "butteraugli" tune is forbiddingly slow.

I'm using my distro's libjxl 0.8.1.

My build command for aom-av1-lavish was

cmake -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_CCACHE=OFF -DENABLE_DOCS=OFF -DENABLE_EXAMPLES=ON -DENABLE_NASM=OFF -DENABLE_TESTS=no -DENABLE_TOOLS=ON -DENABLE_WERROR=OFF -DCONFIG_BIG_ENDIAN=0 -DCONFIG_TUNE_BUTTERAUGLI=1 -DENABLE_NEON=OFF -DENABLE_ARM_CRC32=OFF -DENABLE_NEON_DOTPROD=OFF -DENABLE_NEON_I8MM=OFF -DENABLE_SVE=OFF -DENABLE_MMX=ON -DENABLE_SSE=ON -DENABLE_SSE2=ON -DENABLE_SSE3=ON -DENABLE_SSSE3=ON -DENABLE_SSE4_1=ON -DENABLE_SSE4_2=ON -DENABLE_AVX=ON -DENABLE_AVX2=ON -DENABLE_VSX=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCONFIG_TUNE_BUTTERAUGLI=1 -DCMAKE_C_FLAGS="-march=native -Ofast -pipe -fomit-frame-pointer -g0 -fgraphite-identity -fno-common -flto=12 -fmerge-all-constants -falign-functions=32 -fno-stack-protector -floop-strip-mine -floop-block -ftree-vectorize -floop-interchange -floop-nest-optimize -floop-parallelize-all -fstack-check=no -fno-stack-check -fno-stack-clash-protection" -DCMAKE_C_FLAGS_INIT="-flto=12 -static" /opt/aom-av1-lavish-3b4594d81bed823c41ad95a195cd4b321aebdd07

A bit of GCC ricing, but it's a release build.

Is the enormous performance impact normal for the "butteraugli" tune? Can I do something to reduce this? Newer version of libjxl? Any compile flags for CMake?