Check ARM peformance - Githubissues

nuomi2021 commented 1 year ago

Please help run performance tests for the following clips on ARM chip comand line: ./ffmpeg -i xxx.266 -vsync 0 -f rawvideo /dev/null -y

and fill following table:

clips	FPS(3 times average)
RitualDance_1920x1080_60_10_420_32_LD.26
RitualDance_1920x1080_60_10_420_37_RA.266
Tango2_3840x2160_60_10_420_27_LD.266

top 10 functions for Tango2_3840x2160_60_10_420_27_LD.266	functions	percent
xxx	xx%

Thanks

nuomi2021 commented 1 year ago

@xufuji456 could you help with this? thank you

xufuji456 commented 1 year ago

ok, I try to do it

nuomi2021 commented 1 year ago

Thank you @xufuji456

xufuji456 commented 1 year ago

I run the commons line, it has occurred an error of invalid data. I upgrade FFmpeg just now. I'm not sure what is wrong with the video or command line. One question is, "fps" do you mean decoding frame rate, or something else. Another question is, should I run on Mac or mobile phone with arm chip.

nuomi2021 commented 1 year ago

I run the commons line, it has occurred an error of invalid data. I upgrade FFmpeg just now.

You did not enable vvc decoder and demuxer. and you did not compiled a right ffmpeg. Maybe you can refer to https://github.com/ffvvc/FFmpeg/blob/main/.github/workflows/makefile.yml#L28 for repo address and configuration comandline

I'm not sure what is wrong with the video or command line. One question is, "fps" do you mean decoding frame rate, or something else.

ffmpeg will report fps like this frame= 600 fps= 66 q=-0.0 Lsize= 3645000kB time=00:00:23.96 bitrate=1246237.1kbits/s speed=2.63x

Another question is, should I run on Mac or mobile phone with arm chip.

I prefer mobile phone, because it will enable mobile usage for ffmpeg vvc. But anything in your hand is ok. We will continue bench mark it

thank you

xufuji456 commented 1 year ago

Hello, Nuo Mi, here are some conclusions:

performance with CPU: A76 > A55 > A53
decoding 4K video is very slow, less than 10 frames
decoding speed, 8 bits is faster than 10 bits

On the going, I will find out the Top 10 function.

nuomi2021 commented 1 year ago

One Plus 7 is 3 year old product. We can get 32 fps for some 1080P clips. It's not bad. To save your time. Please check 4k on A76 only. Thank you,

xufuji456 commented 1 year ago

Here are the cost time of Top 10 function(Unit: microsecond). It seems that deblock/SAO filter and MV is time-consuming. Surprisingly, dct and dst is not very time-consuming.

nuomi2021 commented 1 year ago

Maybe something is wrong.

percent is too low
based on https://github.com/ffvvc/FFmpeg/issues/13. ALF will be the top 1 function.

nuomi2021 commented 1 year ago

DCT is not the major time consuming part. Here is the current x86 performance data(only ALF optimized with AVX2) for the 4k video. dct less than 6.5%

11.96% ffmpeg_g [.] put_vvc_luma_hv_10 5.88% ffmpeg_g [.] alf_get_coeff_and_clip_10 5.25% ffmpeg_g [.] ff_vvc_inv_dct2_64 4.30% [kernel] [k] lock_text_start 4.22% ffmpeg_g [.] ff_vvc_alf_filter_luma_w16_16bpc_avx2 3.46% ffmpeg_g [.] put_vvc_luma_bi_hv_10 3.45% ffmpeg_g [.] alf_filter_luma_vb_10 3.13% ffmpeg_g [.] vvc_loop_filter_luma_10 2.81% ffmpeg_g [.] lmcs_filter_luma_10 2.46% ffmpeg_g [.] put_vvc_luma_uni_hv_10 2.27% ffmpeg_g [.] put_vvc_chroma_hv_10 2.21% libc-2.31.so [.] 0x000000000018b733 2.05% libc-2.31.so [.] 0x000000000018bb41 1.95% ffmpeg_g [.] put_vvc_chroma_uni_hv_10 1.84% ffmpeg_g [.] put_vvc_chroma_bi_hv_10 1.81% ffmpeg_g [.] vvc_deblock_bs 1.41% ffmpeg_g [.] ff_vvc_predict_inter 1.25% libpthread-2.31.so [.] pthread_mutex_lock 1.24% libpthread-2.31.so [.] __pthread_mutex_unlock 1.22% ffmpeg_g [.] ff_vvc_residual_coding 1.08% ffmpeg_g [.] alf_filter_cc_10 1.03% ffmpeg_g [.] apply_prof_uni_10 0.99% ffmpeg_g [.] ff_vvc_alf_filter 0.98% ffmpeg_g [.] ff_vvc_inv_dct2_32 0.94% ffmpeg_g [.] vvc_deblock_bs_luma_vertical 0.92% ffmpeg_g [.] add_residual_10

xufuji456 commented 1 year ago

I use single-time of simple function to divide a total frame time. How do you calculate the percent? In fact, the simple function will run multiply times per frame. Maybe you use multiply-times of simple function to divide a frame time.

nuomi2021 commented 1 year ago

please use "perf top". see https://www.google.com/search?q=arm+perf+top&oq=arm+perf+top https://www.brendangregg.com/perf.html

xufuji456 commented 1 year ago

func_time

Here is the Top 20 functions. Maybe miss some functions. I use SimplePerf to do it, which provide by Android/NDK. Except VVC decoder function, memcpy() memset() are time-consuming too.

nuomi2021 commented 1 year ago

Thank you. But seems we still have some functions missed. The top 8 functions only has 16% percent usage

xufuji456 commented 1 year ago

Maybe som functions are not translate out, just show in libffmpeg.so.

nuomi2021 commented 1 year ago

--duration 10 maybe too short Could you run the following comandline on your desktop

adb shell perf record -F 99 ./ffmpeg_g -i Tango2_3840x2160_60_10_420_27_LD.266 -vsync 0 -f rawvideo /dev/null -y
adb shell perf report

xufuji456 commented 1 year ago

I run the command above, but also miss some symbols.

nuomi2021 commented 1 year ago

Thank you for trying. Are you building from ndk? could you share build guidance for me Let me check what's happened.

xufuji456 commented 1 year ago

make clean set -e archbit=64

echo "build for 64bit" ARCH=aarch64 CPU=armv8-a API=21 PLATFORM=aarch64 ANDROID=android OPTIMIZE_CFLAGS="-march=$CPU -mfpu=neon -mfloat-abi=softfp" ABI='arm64-v8a'

export NDK=/Users/xufulong/Library/Android/sdk/ndk-bundle export PREBUILT=$NDK/toolchains/llvm/prebuilt/darwin-x86_64 export TOOLCHAIN=$PREBUILT/bin export SYSROOT=$PREBUILT/sysroot export CROSS_PREFIX=$TOOLCHAIN/$ARCH-linux-$ANDROID- export CC=$TOOLCHAIN/$PLATFORM-linux-$ANDROID$API-clang export CXX=$TOOLCHAIN/$PLATFORM-linux-$ANDROID$API-clang++ export AR=$TOOLCHAIN/$ARCH-linux-$ANDROID-ar export LD=$TOOLCHAIN/$ARCH-linux-$ANDROID-ld export NM=$TOOLCHAIN/$ARCH-linux-$ANDROID-nm export RANLIB=$TOOLCHAIN/$ARCH-linux-$ANDROID-ranlib export STRIP=$TOOLCHAIN/$ARCH-linux-$ANDROID-strip export PREFIX=ffmpeg-android/$ABI export ADDITIONAL_CONFIGURE_FLAG="--cpu=$CPU"

LIB_DIR=$PREFIX export CFLAGS="-O0 -fPIC $OPTIMIZE_CFLAGS -I$LIB_DIR/include" export LDFLAGS="-Wl,--build-id -lc -lm -ldl -llog -lz -L$LIB_DIR/lib"

function build_android { ./configure \ --prefix=$PREFIX \ --cross-prefix=$CROSS_PREFIX \ --target-os=android \ --arch=$ARCH \ --cpu=$CPU \ --cc=$CC \ --cxx=$CXX \ --ar=$AR \ --ranlib=$RANLIB \ --nm=$TOOLCHAIN/$ARCH-linux-$ANDROID-nm \ --strip=$TOOLCHAIN/$ARCH-linux-$ANDROID-strip \ --enable-cross-compile \ --sysroot=$SYSROOT \ --extra-cflags="$CFLAGS" \ --extra-ldflags="$LDFLAGS" \ --extra-ldexeflags=-pie \ --enable-static \ --disable-shared \ --disable-avdevice \ --disable-asm \ --enable-ffmpeg \ --disable-everything \ --enable-decoder=vvc \ --enable-parser=vvc \ --enable-demuxer=vvc \ --enable-protocol=file,pipe \ --enable-encoder=rawvideo \ --enable-muxer=rawvideo,md5 \ --disable-small \ $ADDITIONAL_CONFIGURE_FLAG make -j 8 make install

$TOOLCHAIN/$ARCH-linux-$ANDROID-ld -rpath-link=$SYSROOT/usr/lib/$ARCH-linux-$ANDROID/$API -L$SYSROOT/usr/lib/$ARCH-linux-$ANDROID/$API \ -L$PREFIX/lib -soname libffmpeg.so \ -shared -nostdlib --whole-archive --no-undefined -o $PREFIX/libffmpeg.so \ $PREFIX/lib/libavcodec.a \ $PREFIX/lib/libavfilter.a \ $PREFIX/lib/libavformat.a \ $PREFIX/lib/libswresample.a \ $PREFIX/lib/libswscale.a \ $PREFIX/lib/libavutil.a \ -lc -lm -lz -ldl -llog --dynamic-linker=/system/bin/linker $PREBUILT/lib/gcc/$ARCH-linux-$ANDROID/4.9.x/libgcc_real.a } build_android

Yes, I'm building from ndk. Here is the building shell. In the end, all the modules link into ffmpeg.so

xufuji456 commented 1 year ago

Overhead Symbol 20.44% alf_filter_luma_10 6.66% put_vvc_luma_hv_10 5.89% alf_filter_chroma_10 4.99% ff_vvc_inv_dct2_64 4.53% alf_get_coeff_and_clip_10 4.07% lmcs_filter_luma_10 3.50% __memcpy 3.23% vvc_loop_filter_luma_10 2.51% put_vvc_luma_bi_hv_10 2.25% vvc_deblock_bs 2.09% ff_vvc_alf_filter 1.90% alf_filter_luma_vb_10 1.75% put_vvc_chroma_hv_10 1.55% add_residual_10 1.49% put_vvc_chroma_uni_hv_10 1.47% memset 1.34% put_vvc_luma_uni_hv_10 1.34% put_vvc_chroma_bi_hv_10

It's my mistake. There is no problem with the compiled FFmpeg library. The real reason is that symbol table is stripped when Android Studio runs. Now I disable Android Studio to strip symbol table. Here is the top 20 functions as above.

nuomi2021 commented 1 year ago

No worries. Thank you for the result. Now we have matched result Could you help implement neo/sve code for alf_filter_luma_10? you can refer to

xufuji456 commented 1 year ago

I will try my best to optimize alf_filter_luma_10(). By the way, is it possible that we use Vulkan to optimize h266/vvc ? My question is whether Vulkan is limited by hardware/CPU/GPU. The work likes Lynne do in FFmpeg:

https://github.com/cyanreg/FFmpeg/tree/vulkan
https://lynne.ee/vulkan-video-decoding.html
vulkan decode sample https://github.com/nvpro-samples/vk_video_samples
vulkan decode doc https://github.com/nvpro-samples/vk_video_samples

nuomi2021 commented 1 year ago

We are in a different domain. Vulkan decoder is a wrapper of hw decoder, just like vaapi or nvdec. It will use a hardware fix function to do the decoder. It's highly possible just a wrapper of vaapi or nvenc. The ffvvc is a software decoder. We only use a generic cpu processor to do the decode. Once we ported to Vulkan, we must do all data processes on GPU. CPU can only do some control work. I believe it's possible, but it needs some time. Let us focus on CPU side first. Then we can look at how to use Vulkan to do the full GPU solution thank you.

xufuji456 commented 1 year ago

ok, copy that

nuomi2021 commented 1 year ago

close this since we got the arm performance @xufuji456 thank you

ffvvc / FFmpeg

Check ARM peformance #43