Closed nuomi2021 closed 1 year ago
@xufuji456 could you help with this? thank you
ok, I try to do it
Thank you @xufuji456
I run the commons line, it has occurred an error of invalid data. I upgrade FFmpeg just now. I'm not sure what is wrong with the video or command line. One question is, "fps" do you mean decoding frame rate, or something else. Another question is, should I run on Mac or mobile phone with arm chip.
I run the commons line, it has occurred an error of invalid data. I upgrade FFmpeg just now.
You did not enable vvc decoder and demuxer. and you did not compiled a right ffmpeg. Maybe you can refer to https://github.com/ffvvc/FFmpeg/blob/main/.github/workflows/makefile.yml#L28 for repo address and configuration comandline
I'm not sure what is wrong with the video or command line. One question is, "fps" do you mean decoding frame rate, or something else.
ffmpeg will report fps like this frame= 600 fps= 66 q=-0.0 Lsize= 3645000kB time=00:00:23.96 bitrate=1246237.1kbits/s speed=2.63x
Another question is, should I run on Mac or mobile phone with arm chip.
I prefer mobile phone, because it will enable mobile usage for ffmpeg vvc. But anything in your hand is ok. We will continue bench mark it
thank you
Hello, Nuo Mi, here are some conclusions:
On the going, I will find out the Top 10 function.
One Plus 7 is 3 year old product. We can get 32 fps for some 1080P clips. It's not bad. To save your time. Please check 4k on A76 only. Thank you,
Here are the cost time of Top 10 function(Unit: microsecond). It seems that deblock/SAO filter and MV is time-consuming. Surprisingly, dct and dst is not very time-consuming.
Maybe something is wrong.
DCT is not the major time consuming part. Here is the current x86 performance data(only ALF optimized with AVX2) for the 4k video. dct less than 6.5%
11.96% ffmpeg_g [.] put_vvc_luma_hv_10 5.88% ffmpeg_g [.] alf_get_coeff_and_clip_10 5.25% ffmpeg_g [.] ff_vvc_inv_dct2_64 4.30% [kernel] [k] lock_text_start 4.22% ffmpeg_g [.] ff_vvc_alf_filter_luma_w16_16bpc_avx2 3.46% ffmpeg_g [.] put_vvc_luma_bi_hv_10 3.45% ffmpeg_g [.] alf_filter_luma_vb_10 3.13% ffmpeg_g [.] vvc_loop_filter_luma_10 2.81% ffmpeg_g [.] lmcs_filter_luma_10 2.46% ffmpeg_g [.] put_vvc_luma_uni_hv_10 2.27% ffmpeg_g [.] put_vvc_chroma_hv_10 2.21% libc-2.31.so [.] 0x000000000018b733 2.05% libc-2.31.so [.] 0x000000000018bb41 1.95% ffmpeg_g [.] put_vvc_chroma_uni_hv_10 1.84% ffmpeg_g [.] put_vvc_chroma_bi_hv_10 1.81% ffmpeg_g [.] vvc_deblock_bs 1.41% ffmpeg_g [.] ff_vvc_predict_inter 1.25% libpthread-2.31.so [.] pthread_mutex_lock 1.24% libpthread-2.31.so [.] __pthread_mutex_unlock 1.22% ffmpeg_g [.] ff_vvc_residual_coding 1.08% ffmpeg_g [.] alf_filter_cc_10 1.03% ffmpeg_g [.] apply_prof_uni_10 0.99% ffmpeg_g [.] ff_vvc_alf_filter 0.98% ffmpeg_g [.] ff_vvc_inv_dct2_32 0.94% ffmpeg_g [.] vvc_deblock_bs_luma_vertical 0.92% ffmpeg_g [.] add_residual_10
I use single-time of simple function to divide a total frame time. How do you calculate the percent? In fact, the simple function will run multiply times per frame. Maybe you use multiply-times of simple function to divide a frame time.
please use "perf top". see https://www.google.com/search?q=arm+perf+top&oq=arm+perf+top https://www.brendangregg.com/perf.html
Here is the Top 20 functions. Maybe miss some functions. I use SimplePerf to do it, which provide by Android/NDK. Except VVC decoder function, memcpy() memset() are time-consuming too.
Thank you. But seems we still have some functions missed. The top 8 functions only has 16% percent usage
Maybe som functions are not translate out, just show in libffmpeg.so.
--duration 10 maybe too short Could you run the following comandline on your desktop
I run the command above, but also miss some symbols.
Thank you for trying. Are you building from ndk? could you share build guidance for me Let me check what's happened.
make clean set -e archbit=64
echo "build for 64bit" ARCH=aarch64 CPU=armv8-a API=21 PLATFORM=aarch64 ANDROID=android OPTIMIZE_CFLAGS="-march=$CPU -mfpu=neon -mfloat-abi=softfp" ABI='arm64-v8a'
export NDK=/Users/xufulong/Library/Android/sdk/ndk-bundle export PREBUILT=$NDK/toolchains/llvm/prebuilt/darwin-x86_64 export TOOLCHAIN=$PREBUILT/bin export SYSROOT=$PREBUILT/sysroot export CROSS_PREFIX=$TOOLCHAIN/$ARCH-linux-$ANDROID- export CC=$TOOLCHAIN/$PLATFORM-linux-$ANDROID$API-clang export CXX=$TOOLCHAIN/$PLATFORM-linux-$ANDROID$API-clang++ export AR=$TOOLCHAIN/$ARCH-linux-$ANDROID-ar export LD=$TOOLCHAIN/$ARCH-linux-$ANDROID-ld export NM=$TOOLCHAIN/$ARCH-linux-$ANDROID-nm export RANLIB=$TOOLCHAIN/$ARCH-linux-$ANDROID-ranlib export STRIP=$TOOLCHAIN/$ARCH-linux-$ANDROID-strip export PREFIX=ffmpeg-android/$ABI export ADDITIONAL_CONFIGURE_FLAG="--cpu=$CPU"
LIB_DIR=$PREFIX export CFLAGS="-O0 -fPIC $OPTIMIZE_CFLAGS -I$LIB_DIR/include" export LDFLAGS="-Wl,--build-id -lc -lm -ldl -llog -lz -L$LIB_DIR/lib"
function build_android { ./configure \ --prefix=$PREFIX \ --cross-prefix=$CROSS_PREFIX \ --target-os=android \ --arch=$ARCH \ --cpu=$CPU \ --cc=$CC \ --cxx=$CXX \ --ar=$AR \ --ranlib=$RANLIB \ --nm=$TOOLCHAIN/$ARCH-linux-$ANDROID-nm \ --strip=$TOOLCHAIN/$ARCH-linux-$ANDROID-strip \ --enable-cross-compile \ --sysroot=$SYSROOT \ --extra-cflags="$CFLAGS" \ --extra-ldflags="$LDFLAGS" \ --extra-ldexeflags=-pie \ --enable-static \ --disable-shared \ --disable-avdevice \ --disable-asm \ --enable-ffmpeg \ --disable-everything \ --enable-decoder=vvc \ --enable-parser=vvc \ --enable-demuxer=vvc \ --enable-protocol=file,pipe \ --enable-encoder=rawvideo \ --enable-muxer=rawvideo,md5 \ --disable-small \ $ADDITIONAL_CONFIGURE_FLAG make -j 8 make install
$TOOLCHAIN/$ARCH-linux-$ANDROID-ld -rpath-link=$SYSROOT/usr/lib/$ARCH-linux-$ANDROID/$API -L$SYSROOT/usr/lib/$ARCH-linux-$ANDROID/$API \ -L$PREFIX/lib -soname libffmpeg.so \ -shared -nostdlib --whole-archive --no-undefined -o $PREFIX/libffmpeg.so \ $PREFIX/lib/libavcodec.a \ $PREFIX/lib/libavfilter.a \ $PREFIX/lib/libavformat.a \ $PREFIX/lib/libswresample.a \ $PREFIX/lib/libswscale.a \ $PREFIX/lib/libavutil.a \ -lc -lm -lz -ldl -llog --dynamic-linker=/system/bin/linker $PREBUILT/lib/gcc/$ARCH-linux-$ANDROID/4.9.x/libgcc_real.a } build_android
Yes, I'm building from ndk. Here is the building shell. In the end, all the modules link into ffmpeg.so
Overhead Symbol 20.44% alf_filter_luma_10 6.66% put_vvc_luma_hv_10 5.89% alf_filter_chroma_10 4.99% ff_vvc_inv_dct2_64 4.53% alf_get_coeff_and_clip_10 4.07% lmcs_filter_luma_10 3.50% __memcpy 3.23% vvc_loop_filter_luma_10 2.51% put_vvc_luma_bi_hv_10 2.25% vvc_deblock_bs 2.09% ff_vvc_alf_filter 1.90% alf_filter_luma_vb_10 1.75% put_vvc_chroma_hv_10 1.55% add_residual_10 1.49% put_vvc_chroma_uni_hv_10 1.47% memset 1.34% put_vvc_luma_uni_hv_10 1.34% put_vvc_chroma_bi_hv_10
It's my mistake. There is no problem with the compiled FFmpeg library. The real reason is that symbol table is stripped when Android Studio runs. Now I disable Android Studio to strip symbol table. Here is the top 20 functions as above.
No worries. Thank you for the result. Now we have matched result Could you help implement neo/sve code for alf_filter_luma_10? you can refer to
I will try my best to optimize alf_filter_luma_10(). By the way, is it possible that we use Vulkan to optimize h266/vvc ? My question is whether Vulkan is limited by hardware/CPU/GPU. The work likes Lynne do in FFmpeg:
We are in a different domain. Vulkan decoder is a wrapper of hw decoder, just like vaapi or nvdec. It will use a hardware fix function to do the decoder. It's highly possible just a wrapper of vaapi or nvenc. The ffvvc is a software decoder. We only use a generic cpu processor to do the decode. Once we ported to Vulkan, we must do all data processes on GPU. CPU can only do some control work. I believe it's possible, but it needs some time. Let us focus on CPU side first. Then we can look at how to use Vulkan to do the full GPU solution thank you.
ok, copy that
close this since we got the arm performance @xufuji456 thank you
Please help run performance tests for the following clips on ARM chip comand line: ./ffmpeg -i xxx.266 -vsync 0 -f rawvideo /dev/null -y
and fill following table:
Thanks