After #197, we want to benchmark the inference server and track changes over time.
Implementation
I added scripts to create graphs from the gathered benchmark data and show them on the website with sphinx-charts under a new page called Performance.
Notes
In the future, running similar kinds of benchmarks could be part of the CI to ensure that changes don't degrade the server performance. However, some open questions need to be answered first:
How long should the test run for to be meaningful? In initial tests, even at the MLPerf limits, there were variations in the results.
Temporary dips in performance may be tolerable e.g. if the cause is known. It shouldn't immediately reject the PR if the performance drops.
How much of a drop is significant enough to merit remedial action?
Summary of Changes
Motivation
After #197, we want to benchmark the inference server and track changes over time.
Implementation
I added scripts to create graphs from the gathered benchmark data and show them on the website with
sphinx-charts
under a new page calledPerformance
.Notes
In the future, running similar kinds of benchmarks could be part of the CI to ensure that changes don't degrade the server performance. However, some open questions need to be answered first: