Add JFR performance tests

Karm / mandrel-integration-tests

Integration tests for GraalVM and its Mandrel distribution. Runs Quarkus, Helidon and Micronaut applications and small targeted reproducers. The focus is solely on native-image utility and compilation of Java applications into native executables.

Apache License 2.0

5 stars 3 forks source link

Add JFR performance tests #179

Closed roberttoyonaga closed 10 months ago

roberttoyonaga commented 11 months ago

Summary

This PR attempts to add a JFR performance test that checks RSS upon start up, image size, start up time, mean response time, and latency percentiles (50, 90, 99). The test uses Hyperfoil to benchmark response times and a slightly altered Quarkus getting-started quickstart that has been rigged and tuned to produce JFR events.

I've been running it with mvn clean verify -Ptestsuite -DexcludeTags=all -DincludeTags=jfr-perf

Details

Test runs

There should be 4 test runs in total: No JFR + "normal" case benchmark, Yes JFR + "normal" case benchmark, No JFR +"worst" case benchmark, Yes JFR+"worst" case benchmark. The goal of the test is not to obtain results to analyze absolutely, but instead for comparison purposes. The report.md and measurments.csv files should also contain "diff" info for convenience.

Hyperfoil

Hyperfoil is downloaded from quay and run in "standalone" mode. The test communicates with the Hyperfoil controller via its REST API. Two Hyperfoil benchmarks have been added in this PR, a "normal" case and "worst" case. The "worst" case benchmark is meant to emphasize the performance impact of JFR infrastructure, and it emits an unrealistically large number of events. The "normal" case benchmark is meant to be more realistic and is what will be used to generate the metrics that are compared to thresholds defined in threshold.properties. Please see the test's javadoc for more info.

Quarkus app

The app has two main endpoints. "hello/work/{text}" is what the "worst" case benchmark hits, and "hello/regular/{text}" is what the normal case benchmark hits. This requires the test to only build the app twice (with and without JFR included in the image).

Worth mentioning

Cache and turbo boost

The test clears the cache before measuring startup time. This is averaged over several trials. Turbo boost is disabled before running the Hyperfoil benchmark and enabled again afterwards. Both of these operations require executing a command with sudo. This is not a problem on my local machine, because the test pauses midway and prompts me to enter my passcode. However, I'm not sure if this will be an issue if this test suite is run automatically as part of CI. If this is a problem, I'll instead add some "warm up" runs to populate the cache before observing. I'm not sure how else to deal with turbo boost though.

Client server sharing resources

The hyperfoil benchmark part of the test uses a client and server. Currently, they run on the same machine with out any isolation of resources. This maybe won't be an issue if this perf test is going to be used as a "sanity check" to give us a rough idea of how JFR is performing. But this problem will need to be addressed if we want to be confident we have reliable results that can be published.

Mandrel version

Currently the test is only set up to run with 23.0. This is because that's the first version jdk.ThreadPark is supported. This event is useful to test the JFR infrastructure because we can manually invoke an event emission by calling LockSupport.parkNanos(). A similar thing can be achieved with jdk.ThreadSleep and calling Thread.sleep(), or jdk.JavaMonitorWait and wait(). However, even when the specified sleep/wait times are set to 1ns, the non-JFR overhead of calling these methods is actually relatively large. By contrast, LockSupport.parkNanos() has very low non-JFR overhead, which allows us to spend a higher percentage of time exercising the JFR code.

In the future I might spend some more time tuning this test using different events so that it can be used with 22.3 and 21.2.

Profiling with Perf

Here I basically just want to show that the Quarkus app is in fact spending most of its time where we'd expect it to (dealing with the thread park events).

With JFR: Built with mvn package -Pnative -Dquarkus.version=3.1.2.Final -Dquarkus.native.debug.enabled -Dquarkus.native.monitoring=jfr -Dquarkus.native.additional-build-args=-H:+SignalHandlerBasedExecutionSampler,-H:-DeleteLocalSymbols

Without JFR Built with mvn package -Pnative -Dquarkus.version=3.1.2.Final -Dquarkus.native.debug.enabled -Dquarkus.native.additional-build-args=-H:-DeleteLocalSymbols

Karm commented 10 months ago

Thank you so much @roberttoyonaga :partying_face: :fireworks:, This is an awesome addition. I will read the source, try it, and share feedback. Right off the bat, let me respond to some remarks in your PR text:

sudo. This is not a problem on my local machine...

Sudo is not a problem for CI either. E.g. podman tests also require sudo so as to be able to attach debuggers to container on RHEL 8 etc. Some care needs to be taken about putting the system in its original state though as not all test systems are throwaway VMs. We execute also on long running baremetal systems. When we know about it now, we can easily make sure it's O.K.

Client server sharing resources But this problem will need to be addressed if we want to be confident we have reliable results that can be published.

Mandrel-integration-tests is by no means a vehicle for reliable, meaningful performance testing as its inherent design flaw, if you will, is that it disturbs the tested system by all JVM+Maven runtime shenanigans. We cannot use it to confidently publish results about JFR monitored apps throughput in absolute numbers for a given system. We can and we will use it heavily to compare Mandrel versions/Quarkus versions and to guard against regressions between versions though.

In the future I might spend some more time tuning this test using different events so that it can be used with 22.3 and 21.2.

23.0 support is splendid. 22.3 would be nice to have. 21.x is superfluous as we are retiring that version support from the TS soon #178 .

Karm commented 10 months ago

(CI has some big pending tweaks in #177 )

Karm commented 10 months ago

@roberttoyonaga Hyperfoil:There are some problems with synchronization, e.g. you assume hyperfoil is already downlaoded on the system and you give 15s to wait for endpoint. By the time it expires, the image is not yet even downlaoded. I am proposing a fix in a while. As I am browsing it step by step...

roberttoyonaga commented 10 months ago

@roberttoyonaga We need to deal with Jakarta in Quarkus 2.x vs. Quarkus 3.x. (we do that via applying a patch before test and reverting in finally).

Question: Does it make sense to run with Quarkus 2.x? I can commit that patch/revet in your branch if you don't mind.

There isn't anything particularly special about Quarkus 3 that it needs, so using Quarkus 2.x should be fine as long as the mandrel version is 23+.

roberttoyonaga commented 10 months ago

@roberttoyonaga Hyperfoil:There are some problems with synchronization, e.g. you assume hyperfoil is already downlaoded on the system and you give 15s to wait for endpoint. By the time it expires, the image is not yet even downlaoded. I am proposing a fix in a while. As I am browsing it step by step...

Oh yes that's true. I think the step you added here https://github.com/Karm/mandrel-integration-tests/pull/179/commits/7caab529e92dd25bc77de25acf575f8afe616f46#diff-cf0d7960d2c75384d9f858a533ceddead58a7d8456ee54f6dfbf77036c6004e7R251-R253 should fix the problem. In the original version of this test I just keep hitting the hyperfoil sever's /openapi endpoint until I get a successful response before continuing to the next step.

Karm commented 10 months ago

@roberttoyonaga uses our threshold.properties to store acceptable difference between some two runs (jfr, no jfr in this case). That inspired me to refactor the rest of old JFR smoke tests too and to create a notion of diff. E.g.

apps/debug-symbols-smoke/threshold.properties

linux.jvm.time.to.finish.threshold.ms=6924
linux.native.time.to.finish.threshold.ms=14525
windows.jvm.time.to.finish.threshold.ms=8000
windows.native.time.to.finish.threshold.ms=17948
linux.diff_jvm.time.to.finish.threshold.ms=20
linux.diff_native.time.to.finish.threshold.ms=30
windows.diff_jvm.time.to.finish.threshold.ms=20
windows.diff_native.time.to.finish.threshold.ms=30

linux.jvm.time.to.finish.threshold.ms is an absolute number that when crossed, fails the test. linux.diff_jvm.time.to.finish.threshold.ms is an acceptable difference of 20 ms in JVM mode between JFR and noJFR runs (or any other runs when used that way).

I also changed the wording of the error messages, so as it's more apparent that we are talking about overhead or something having a greater difference:

org.opentest4j.AssertionFailedError: Application JFR_PERFORMANCE in mode diff_native took 65 ms more to get the first OK
 request, which is over 60 ms threshold by 8%. ==> expected: <true> but was: <false>
        at org.graalvm.tests.integration.JFRTest.startComparisonForBenchmark(JFRTest.java:208)
        at org.graalvm.tests.integration.JFRTest.jfrPerfTest(JFRTest.java:167)

...

[ERROR] Failures: 
[ERROR]   JFRTest.jfrSmokeTest:125->jfrSmoke:473 Application JFR_SMOKE in mode diff_native took 21 ms more to finish,
 which is over 15 ms threshold by 40%. ==> expected: <true> but was: <false>
[INFO] 
[ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0

@zakkak ^

Karm commented 10 months ago

@roberttoyonaga Is this something you are looking at or is it lost in the sea of re-runs?

 Error:  Failures: 
Error:    JFRTest.jfrPerfTest:176 build-and-run.log log should not contain error or warning lines that are not whitelisted. See /home/runner/work/mandrel-integration-tests/mandrel-integration-tests/ts/testsuite/target/archived-logs/org.graalvm.tests.integration.JFRTest/jfrPerfTest/build-and-run.log and check these offending lines: 
17:30:44,627 ERROR (vert.x-eventloop-thread-2) [i.h.c.ControllerVerticle] Uncaught error: java.lang.NullPointerException ==> expected: <true> but was: <false>
[INFO] 
Error:  Tests run: 22, Failures: 1, Errors: 0, Skipped: 5
[INFO]

Karm commented 10 months ago

@roberttoyonaga We need to deal with Jakarta in Quarkus 2.x vs. Quarkus 3.x. (we do that via applying a patch before test and reverting in finally). Question: Does it make sense to run with Quarkus 2.x? I can commit that patch/revet in your branch if you don't mind.

There isn't anything particularly special about Quarkus 3 that it needs, so using Quarkus 2.x should be fine as long as the mandrel version is 23+.

Ack. We don't test Q 2.x with M 23.x. So this would be interesting only for M 22.x runs. I will just add a Q version check then.

Karm commented 10 months ago

Looking at failures....

Error:  Failures: 
Error:    JFRTest.jfrPerfTest:176 build-and-run.log log should not contain error or warning lines that are not whitelisted. See /home/runner/work/mandrel-integration-tests/mandrel-integration-tests/ts/testsuite/target/archived-logs/org.graalvm.tests.integration.JFRTest/jfrPerfTest/build-and-run.log and check these offending lines: 
23:04:35,835 ERROR (vert.x-eventloop-thread-2) [i.h.c.ControllerVerticle] Uncaught error: java.lang.NullPointerException ==> expected: <true>

IMHO not a TS issue...

Karm commented 10 months ago

@roberttoyonaga I would like to merge this. Our TODO list as I see it:

[ ] 1. It would be nice if we could nail down the ERROR messages, e.g. by whitelisting them & filing an issue, if they are benign. Could you take a look?
[ ] 2. I will add a Q version check.
[ ] 3. I will kindly ask you to help me to understand the numbers that are being recorded. I will be interpreting those across several platforms and likely also plotting them to ImPlot charts, so I need to be sure. More below:

3. The Numbers

I can see this in the report.md on my laptop: (sheets reformatted for brevity)

work Measurements PLAINTEXT_PERFORMANCE:

App	timeToFirstOKRequestMs	RSSKb	executableSizeKb	meanResponseTime
PLAINTEXT_PERFORMANCE	115	58386	43772	3933096

maxResponseTime	responseTime50Percentile	responseTime90Percentile	responseTime99Percentile
21626879	4390911	5046271	11468799

Soooo, it took 115ms to get the first OK HTTP response from the Quarkus endpoint. Simple enough. RSSKb also clear enough. What is meanResponseTime 3933096 though? If it's in ms, it would be more than an hour. If it's ns, it would be 4 ms and that sounds reasonable. Could you add units, please? maxResponseTime, same question as above. responseTime50Percentile 4390911 - hmm, does it mean, that more than 50% of all requests made during the test successfully made it under 4.3ms? Or that 4390911 requests (as a number of requests) made it through in 50% of some set time threshold? Same question for responseTime90Percentile etc.

I do apologize in advance if it should be self evident. I would rather ask a stupid question online than to misinterpret something several orders of magnitude wrong :-D

work Measurements Diff %:

App	timeToFirstOKRequestMs	RSSKb	executableSizeKb	meanResponseTime
JFR_PERFORMANCE	85	21	5	183

maxResponseTime	responseTime50Percentile	responseTime90Percentile	responseTime99Percentile
86	144	228	222

I understand we are dealing with a diff here, on the assumption that e.g. timeToFirstOKRequestMs will never be better with JFR than without it. In a wild case this happens though, e.g. a VM on CI being migrated gets ludicrously slow while running noJFR baseline and then runs normally when JFR is tested, are we good WRT division by 0 etc.? Could it handle such case and spit a warning/error message? I am even inclined to fail() such test that would break sanity.

timeToFirstOKRequestMs 85. Soo...the timeToFirstOKRequestMs is 85% worse, i.e. +85% with JFR than without it? Or is it bigger by 85 ms? Similar question for other columns.

Thank you so much for taking the time. It is fine if you just jot down the notes here and I can write it down in a README if you like.

roberttoyonaga commented 10 months ago

Hi @Karm!

would be nice if we could nail down the ERROR messages

Yes I will nail down that error message. I hadn't seen it when running the JFR test in isolation. I'll see if I can reproduce it running all tests together.

The Numbers:

What is meanResponseTime 3933096 though? If it's in ms, ...

Yes the absolute values for response times (mean, max, percentiles) are in ns, not ms. I'll add units in.

hmm, does it mean, that more than 50% of all requests made during the test successfully made it under 4.3ms?

With respect to the response time percentiles, responseTimeXPercentile=N means that X% of requests completed in under N ns. In the example you show it means 50% of all requests made during the test successfully made it under 4.3ms.

are we good WRT division by 0 etc.?

Right now, I haven't added any handling of division by 0 because I only expect this to happen if the absolute value of some metric were to be 0, which I suppose is possible. I'll add in some special handling for this case.

timeToFirstOKRequestMs 85. Soo...the timeToFirstOKRequestMs is 85% worse,

Currently, the diff calculation is as follows: Absolute_Value(JFR_measurement - NO_JFR_measurement)/ NO_JFR_measurement. So it shows the the % difference between the with/without JFR with respect to running normally without JFR. The none of diff values reported are absolute values.

roberttoyonaga commented 10 months ago

Hi @Karm,

I pushed a commit adding the changes you suggested and fixing a few minor bugs. I ran the whole test suite multiple times but was unable to reproduce the error showing in the github actions logs. Is there any way to get more info on that error (do the GH actions store more verbose logs somewhere)?

Karm commented 10 months ago

Hi @Karm,

I pushed a commit adding the changes you suggested and fixing a few minor bugs. I ran the whole test suite multiple times but was unable to reproduce the error showing in the github actions logs. Is there any way to get more info on that error (do the GH actions store more verbose logs somewhere)?

@roberttoyonaga This is the one that had an error in the log: https://github.com/Karm/mandrel-integration-tests/suites/15327409135/artifacts/875328229

And this one merely timed out this time: https://github.com/Karm/mandrel-integration-tests/actions/runs/5931993941/job/16084983949#step:8:17153

Karm commented 10 months ago

@roberttoyonaga The ERROR message comes from Hyperfoil:

1954 23:17:17,696 INFO  (main) [i.h.Hyperfoil] Java: Red Hat, Inc. OpenJDK 64-Bit Server VM 11.0.20 11.0.20+8-LTS (/usr/lib/jvm/java-11-openjdk-11.0.20.0.8-2.el8.x86_64), CWD /home/jboss
1955 23:17:17,703 INFO  (main) [i.h.Hyperfoil] Hyperfoil: 0.25.2 (0382c3340baf8205c3ad1d93ec475a106f6f2820)
1956 23:17:17,704 INFO  (main) [i.h.Hyperfoil]            DISTRIBUTION:  /deployment
1957 23:17:17,708 INFO  (main) [i.h.Hyperfoil]            ROOT_DIR:      /tmp/hyperfoil
1958 23:17:17,709 INFO  (main) [i.h.Hyperfoil]            BENCHMARK_DIR: /tmp/hyperfoil/benchmark
1959 23:17:17,709 INFO  (main) [i.h.Hyperfoil]            RUN_DIR:       /tmp/hyperfoil/run
1960 23:17:17,710 INFO  (main) [i.h.Hyperfoil]            HOOKS_DIR:     /tmp/hyperfoil/hooks
1961 23:17:17,711 INFO  (main) [i.h.Hyperfoil] Starting non-clustered Vert.x...
1962 23:17:18,092 INFO  (main) [i.h.Hyperfoil] Deploying ControllerVerticle...
1963 23:17:18,113 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerVerticle] Starting in directory /tmp/hyperfoil...
1964 23:17:18,454 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerServer] Hyperfoil controller listening on http://fv-az377-431:8090
1965 23:17:18,458 INFO  (vert.x-eventloop-thread-1) [i.h.Hyperfoil] ControllerVerticle deployed.
1966 23:17:21,107 INFO  (vert.x-worker-thread-1) [i.h.c.u.PersistenceUtil] Stored benchmark 'jfr-hyperfoil' in /tmp/hyperfoil/benchmark/jfr-hyperfoil.yaml
1967 23:17:21,460 INFO  (vert.x-eventloop-thread-0) [i.h.c.u.PersistenceUtil] Stored benchmark 'jfr-hyperfoil' in /tmp/hyperfoil/run/0000/jfr-hyperfoil.yaml
1968 23:17:22,483 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerVerticle] Starting benchmark jfr-hyperfoil - run 0000
1969 23:17:22,483 INFO  (vert.x-eventloop-thread-2) [i.h.c.AgentVerticle] 639bc981-76ac-434c-bc39-7dd1c2dcb2a6 Got reply from controller.
1970 23:17:22,485 INFO  (vert.x-eventloop-thread-2) [i.h.c.AgentVerticle] Initializing agent
1971 23:17:22,528 INFO  (vert.x-eventloop-thread-2) [i.h.h.c.HttpClientPoolImpl] Allocating 25 connections (max 25, buffer 0) in 1 executors to http://localhost:8080
1972 23:17:22,687 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerPhase] 0000 main changing status NOT_STARTED to STARTING
1973 23:17:22,694 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerPhase] 0000 main changing status STARTING to RUNNING
1974 23:17:27,685 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerPhase] 0000 main changing status RUNNING to FINISHING
1975 23:17:27,690 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerPhase] 0000 main changing status FINISHING to FINISHED
1976 23:17:27,696 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerPhase] 0000 main changing status FINISHED to TERMINATED
1977 23:17:27,697 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerVerticle] 0000 All phases are terminated.
1978 23:17:27,698 INFO  (vert.x-eventloop-thread-2) [i.h.c.AgentVerticle] Received agent reset
1979 23:17:27,704 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerVerticle] Run 0000: completing stats for phase main
1980 23:17:27,717 ERROR (vert.x-eventloop-thread-2) [i.h.c.ControllerVerticle] Uncaught error: java.lang.NullPointerException
1981   at io.hyperfoil.clustering.AgentVerticle.lambda$initBenchmark$9(AgentVerticle.java:247)
1982   at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:264)
1983   at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:246)
1984   at io.vertx.core.impl.EventLoopContext.lambda$runOnContext$0(EventLoopContext.java:43)
1985   at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
1986   at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
1987   at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
1988   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
1989   at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
1990   at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
1991   at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
1992   at java.base/java.lang.Thread.run(Thread.java:829)
1993 
1994 23:17:27,754 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerVerticle] Run 0000 completed
1995 23:17:27,960 INFO  (vert.x-eventloop-thread-0) [i.h.c.ControllerVerticle] Successfully persisted run 0000

Although the report is generated as expected. I will whitelist the error as Hyperfoil thing.

Karm commented 10 months ago

@roberttoyonaga Thanks for explaining the numbers.

timeToFirstOKRequestMs 85. Soo...the timeToFirstOKRequestMs is 85% worse,

Currently, the diff calculation is as follows: Absolute_Value(JFR_measurement - NO_JFR_measurement)/ NO_JFR_measurement. So it shows the the % difference between the with/without JFR with respect to running normally without JFR. The none of diff values reported are absolute values.

I think I get it. Let's say it took 100ms with JFR and 80ms without JFR:

jshell> 
   ...> long measurementsJfr=100;
   ...> long measurementsNoJfr=80;
   ...> System.out.println((Math.abs(measurementsJfr - measurementsNoJfr) * 100.0 / measurementsNoJfr));
   ...> 
measurementsJfr ==> 100
measurementsNoJfr ==> 80
25.0

So having JFR on in this example brings 25% penalty?

With some numbers from my laptop:

work Measurements Diff %:

timeToFirstOKRequestMs	RSSKb	executableSizeKb	meanResponseTimeNs
47	42	5	109

regular Measurements Diff %:

timeToFirstOKRequestMs	RSSKb	executableSizeKb	meanResponseTimeNs
32	17	5	6

I can say that having JFR on for work test brought 47% slower timeToFirstOKRequestMs and in regular case it was 32% slower? With meanResponseTimeNs being 106% slower with JFR on for the work test, while being just 6% slower in the regular test case?

I.e. the bigger the worse. Correct?

Karm commented 10 months ago

@roberttoyonaga Are you O.K. with these docs notes?

https://github.com/Karm/mandrel-integration-tests/pull/179/commits/996cf65467656081d12e273d664d7a0bc351d866

Karm commented 10 months ago

The ERROR is gone, but the sad news is that for Mandrel 23.0, the 80 minutes timeout on GHA is not enough any more. That is concerning, because the TS does not do much new stuff. The main battery of tests, but for this PR, is fairly old. Our subsequent releases should perform faster, not slower.

~~I will see if I could rise the timeout conditionally in the matrix so as we don't have to rise it for all combinations.~~

My mistake. I did not realize the test actually does not run for older Mandrel. So it's reasonable to prolong the timeout.

Karm commented 10 months ago

@roberttoyonaga @galderz CI is green. THX, Let's merge and leave future enhancements for another PR.

:partying_face: It's great :)

roberttoyonaga commented 10 months ago

I can say that having JFR on for work test brought 47% slower timeToFirstOKRequestMs and in regular case it was 32% slower? With meanResponseTimeNs being 106% slower with JFR on for the work test, while being just 6% slower in the regular test case? I.e. the bigger the worse. Correct?

Yes that understanding is correct.

@roberttoyonaga Are you O.K. with these docs notes?

I'm happy with those notes.

Thanks Karm!