Would making `diff_` thresholds a percentage instead of absolute values make more sense?

Karm / mandrel-integration-tests

Integration tests for GraalVM and its Mandrel distribution. Runs Quarkus, Helidon and Micronaut applications and small targeted reproducers. The focus is solely on native-image utility and compilation of Java applications into native executables.

Apache License 2.0

5 stars 3 forks source link

Would making `diff_` thresholds a percentage instead of absolute values make more sense? #206

Open zakkak opened 1 year ago

zakkak commented 1 year ago

Right now, diff_* thresholds for performance regression testing are defined as absolute numbers, e.g.:

https://github.com/Karm/mandrel-integration-tests/blob/e67f2fd749af73b058670eac927138289413cbe2/apps/jfr-native-image-performance/threshold.properties#L1

This sometimes results in test failures when running on different machines than the one used to tune the thresholds.

However, I am thinking that checking if the increase is within an acceptable range, e.g. 5%, would probably make more sense. After all a 50ms increase on a 10ms run is huge, while on a 5s run it's negligible.

I wonder if switching to percentages instead would also allow us to perform the regression testing (only for diffs between runs) on various machines (including github runners) while not losing accuracy.

cc @Karm @jerboaa

roberttoyonaga commented 1 year ago

Hi @zakkak just chiming in here - The Jfr perf test thresholds are specified as a relative change ( |new - old| / old ). Maybe something similar could make sense elsewhere too.

Karm commented 1 year ago

Definitely makes sense, requires recording JVM run as a baseline, but that already happens, see the notion of diff_jvm and diff_native suffixes in threshold.properties.

zakkak commented 1 year ago

@Karm what percentage would you consider acceptable?

Karm commented 1 year ago

@Karm what percentage would you consider acceptable?

There are 2 things:

1) % difference between JVM (time-to-first-ok-request, time-to-complete, RSS) and Native, i.e. is it acceptable, that Native's , time-to-complete is 10% worse etc.

2) And then there is a deviation from some hardcoded value.

I'd focus on 2) and I'd hardcode values from Q 2.13.8.Final, M 22.3.3.1-Final run on a reference system. I'd run again with Q 2.16.9.Final, M 22.3.3.1-Final on a reference system and record the percentage difference. That is what I'd use as acceptable percentage to judge the success of failure of Quarkus 3.x and M 23.x. By reference system I mean one of the stock 8 cores 16 g ram RHEL 8 contemporary Xeon backed VMs I use as they have pretty stable profile.