aws / aws-graviton-getting-started

Helping developers to use AWS Graviton2, Graviton3, and Graviton4 processors which power the 6th, 7th, and 8th generation of Amazon EC2 instances (C6g[d], M6g[d], R6g[d], T4g, X2gd, C6gn, I4g, Im4gn, Is4gen, G5g, C7g[d][n], M7g[d], R7g[d], R8g).
https://aws.amazon.com/ec2/graviton/
Other
884 stars 199 forks source link

ffmpeg performance on graviton3 is not as good as x86 #311

Closed xshen053 closed 1 year ago

xshen053 commented 1 year ago

Hi, I used phoronix-test-suite to benchmark performance on x86 and graviton3 instance. This test suite uses vbench to benchmark performance of ffmpeg. I used c6a.8xlarge and c7g.8xlarge instances. However the result is not expected. Maybe I did something wrong?

Scenario

Encoder: libx264 - Scenario: Video on demand

c6a.8xlarge

FPS: 44.6 seconds: 170

c7g.8xlarge

FPS: 30.64 seconds: 247

From this blog, seems graviton has some optimization

https://aws.amazon.com/blogs/opensource/optimized-video-encoding-with-ffmpeg-on-aws-graviton-processors/

snadampal commented 1 year ago

Hi @xshen053 , please share the below details to better understand the scenario

  1. ffmpeg version you are using
  2. whether you are building them from sources or using the release binaries
  3. which OS distribution
  4. ffmpeg commandline being used in the benchmarking or exact commands you used for running the vbench.
xshen053 commented 1 year ago

ffmpeg

http://ffmpeg.org/releases/ffmpeg-6.0.tar.xz

x264

http://www.phoronix-test-suite.com/benchmark-files/x264-20221005.tar.xz

OS distribution

Ubuntu 20.04.6 LTS

exact commands I used for running the vbench

install phoronix-test-suite

git clone https://github.com/phoronix-test-suite/phoronix-test-suite.git sudo ./install-sh might need install other dependency like

sudo apt install php7.4-cli
sudo apt-get install php-xml

execute benchmark

phoronix-test-suite benchmark ffmpeg image choose 1

image choose 4

then it will automatically execute vbench benchmark and give me results after finishing. image

xshen053 commented 1 year ago

hey, can you run the test I did, do you need any other informations?

snadampal commented 1 year ago

Hi @xshen053 , we are trying to reproduce your scenario. Can you also please let's know the CPU utilization you observed during the runs? is it possible to use your real application for benchmarking the performance instead ?

xshen053 commented 1 year ago
AWSjswinney commented 1 year ago

I tried to run the test but I ran into a problem with the phoronix code so I wasn't able to run it without investing some time to debug. By guess would be that the test is single threaded (or at least leaving many cores idle) which would explain the performance discrepancy. Graviton CPUs in general are optimized to sustain large workloads over many (or all) cores without reductions in performance. Other CPUs that have SMT can encounter resource constraints when fully loading across the whole instance. There is some data about that here: https://github.com/aws/aws-graviton-getting-started/blob/main/perfrunbook/system-load-and-compute-headroom.md

As mentioned in the blog post that you linked to, most video workloads utilize entire instances in order to transcode many video streams or files in parallel. This leads to lowest cost per unit time of video. When I ran the benchmarks for that post, I designed my benchmarks to fully load the instances. In that scenario, Graviton3 powered c7g instances achieved the lowest cost to encode of the instances I tested, which were C6i, C6a, and C7g.

I suspect that there is some step that you ran that is missing which is preventing me from running the phoronix test. (That would ultimately mean there's a bug in the test.) Perhaps libx264-devel was already installed from the Ubuntu package manager that led to ffmpeg building with an older (and less optimized) version? (Just a guess...)

AWSjswinney commented 1 year ago

A rough approximation of the CPU usage with htop would be fine. If you want to explore a more rigorous method, use sysstat. Just make sure you get an idea for how the usage changes during the test. E.g. does it start out using one core and then use all in the middle? Is it steady state or periodic?

xshen053 commented 1 year ago

Thanks for the reply!

AWSjswinney commented 1 year ago

Everything depends on the workload and you would need to benchmark for what you are interested in, but it can be the case that a single thread will run faster on an M7i, M6i, or C6a than C7g, as you have seen with this ffmpeg benchmark.

geoffreyblake commented 1 year ago

Closing this issue as the question appears to have been answered.