chrisgleissner / loom-webflux-benchmarks

Benchmarks of Spring Boot REST service comparing Java 21 Virtual Threads (Project Loom) with WebFlux (Project Reactor).
Apache License 2.0
12 stars 1 forks source link
benchmark http java java21 jpa loom matplotlib microservice performance-optimization performance-testing postgresql projectloom python3 rest scalability spring-boot-cache springboot3 ubuntu2404 virtualthreads webflux

Benchmark of Java Virtual Threads vs WebFlux

build benchmark soaktest Coverage Status

This Java 21 project benchmarks a simple Spring Boot 3.3 microservice using configurable scenarios, comparing Java Virtual Threads (introduced by Project Loom, JEP 444) using Tomcat and Netty with Spring WebFlux (relying on Project Reactor) using Netty.

All benchmark results below come from a dedicated bare metal test environment. The benchmark is also scheduled to run monthly on GitHub-hosted runners, using all combinations of (Ubuntu 22.04, Ubuntu 24.04) and (Java 21, Java 23).

Background

Both Spring WebFlux and Virtual Threads are alternative technologies to create Java microservices that support a high number of concurrent users, mapping all incoming requests to very few shared operating system threads. This reduces the resource overhead incurred by dedicating a single operating system thread to each user.

Spring WebFlux was first introduced in September 2017. Virtual Threads were first introduced as preview feature with Java 19 and were fully rolled out with Java 21 in September 2023.

TL;DR

[!NOTE] In a nutshell, the benchmark results are:

Virtual Threads on Netty (using blocking code) showed very similar and often superior performance characteristics (latency percentiles, requests per second, system load) compared with WebFlux on Netty (using non-blocking code and relying on Mono and Flux from Project Reactor):

  • Virtual Threads on Netty was the benchmark winner for ca. 40% more combinations of metrics and benchmark scenarios than Project Reactor on Netty.
  • For all high user count scenarios, it had the lowest latency as well as the largest number of requests for the entirety of each benchmark run.
  • In many cases (e.g. 60k-vus-smooth-spike-get-post-movies), the 90th and 99th percentile latencies (P90 and P99) were considerably lower for Virtual Threads on Netty when compared with WebFlux on Netty.
  • For both approaches, we could scale up to the same number of virtual users (and thus TCP connections) before exhausting the CPU and running into time-outs due to rejected TCP connection requests.

Virtual Threads on Tomcat are not recommended for high load:

  • We saw considerably higher resource use compared with the two Netty-based approaches.
  • There were many time-out errors as visualized by red dots in the charts, even when the CPU use was far below 100%. In contrast, none the Netty-based scenarios experienced any errors, even with a CPU use of 100%.

Benchmark Winners

Below are top-performing approaches across all scenarios and metrics, visualizing the contents of results/scenarios-default/results.csv:

All Approaches

This chart compares Project Loom (on both Tomcat and Netty) with Project Reactor (on Netty).

All Results

Netty-based Approaches

This chart is based on same benchmark as before, but only considers Netty-based approaches.

Netty Results

Benchmark Features

Benchmark Design

The benchmark is driven by k6 which repeatedly issues HTTP requests to a service listening at http://localhost:8080/

The service exposes multiple REST endpoints. The implementation of each has the same 3 stages:

  1. HTTP Call: If $delayCallDepth > 0, call GET /$approach/epoch-millis recursively $delayCallDepth times to mimic calls to upstream service(s).
  2. Wait: If $delayCallDepth = 0, wait $delayInMillis (default: 100) to mimic the delay incurred by a network call, filesystem access, or similar.
    • Whilst the request waits, its operating system thread can be reused by another request.
    • The imperative approaches (platform-tomcat, loom-tomcat, and loom-netty) use blocking wait whilst the reactive approach (webflux-netty) uses non-blocking wait.
  3. Calculate and Return Response specific to REST endpoint.

Sample Flow

Get all movies using loom-netty approach, an HTTP call depth of 1 and a delay of 100 milliseconds:

sequenceDiagram
    participant k6s
    participant service
    k6s->>+service: GET /loom-netty/movies?delayCallDepth=1&delayMillis=100
    service->>+service: GET /loom-netty/epoch-millis?delayCallDepth=0&delayMillis=100
    service->>service: Wait 100 milliseconds
    service-->>-service: Return current epoch millis
    service->>service: Find movies
    service-->>-k6s: Return movies

REST APIs

The microservice under test exposes several RESTful APIs. In the following descriptions, $approach is the approach under test and can be one of loom-tomcat, loom-netty, and webflux-netty.

All REST APIs support the following query parameters:

epoch-millis

The TimeController returns the milliseconds since the epoch, i.e. 1 Jan 1970:

movies

The MovieController gets and saves movies which are stored in an H2 in-memory DB via Spring Data JPA, fronted by a Caffeine-backed Spring Boot cache:

DB Considerations:

Supported requests:

Requirements

Software

Hardware

The hardware requirements depend purely on the scenarios configured in src/main/resources/scenarios/scenarios-default.csv. The following is recommended to run the default scenarios committed to this repo:

Setup

The following instructions assume you are using a Debian-based Linux such as Ubuntu 22.04 or 24.04.

Java 21

You'll need Java 21 or above:

sudo apt install openjdk-21-jdk

k6

k6 is used to load the service:

sudo gpg -k
sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update
sudo apt-get install k6

Python 3, matplotlib, sar and sadf

Python 3 and matplotlib are used to convert the CSV output of k6 and sar/sadf to a single PNG chart. The sar and sadf tools come as part of sysstat and are used to measure resource use. To install them run:

sudo apt update && sudo apt install -y python3 python3-matplotlib sysstat

Linux Optimizations

The following adjustments optimize Linux for HTTP load tests.

Increase Open File Limit

Ensure your system can handle a large number of concurrent connections:

printf '* soft nofile 1048576\n* hard nofile 1048576\n' | sudo tee -a /etc/security/limits.conf 

Increase Port Range and Allow Fast Connection Reuse

Increase the port range for outgoing TCP connections and allow quick connection reuse:

printf 'net.ipv4.ip_local_port_range=1024 65535\nnet.ipv4.tcp_tw_reuse = 1\n' | sudo tee -a /etc/sysctl.conf && sudo sysctl -p

Activate Changes

Log out and back in.

Execution

benchmark.sh

Run a benchmark for each combination of approaches and scenarios defined in a scenario CSV file. Results are stored in build/results/:

./benchmark.sh 

Usage as per benchmark.sh -h:

Usage: benchmark.sh [OPTION]... [SCENARIO_FILE]
Runs benchmarks configured by a scenario file.

SCENARIO_FILE:     Scenario configuration CSV file in src/main/resources/scenarios/. Default: scenarios-default.csv

OPTION:
  -a <approaches>  Comma-separated list of approaches to test. Default: loom-tomcat, loom-netty, webflux-netty
                   Supported approaches: platform-tomcat, loom-tomcat, loom-netty, webflux-netty
  -C               Keep CSV files used to create chart. Default: false
  -h               Print this help

benchmarks.sh

This is a wrapper over benchmark.sh and supports multiple scenario files:

./benchmarks.sh 

Usage as per benchmarks.sh -h:

Usage: benchmarks.sh [OPTION]... [SCENARIO_FILE]...
Wrapper over benchmark.sh that supports multiple scenario files and optionally suspends the system on completion.

SCENARIO_FILE:           Zero or more space-separated scenario configuration CSV files in src/main/resources/scenarios/.
                         Default: scenarios-default.csv scenarios-deep-call-stack.csv scenarios-postgres.csv scenarios-sharp-spikes.csv scenarios-soaktest.csv

OPTION:
  -d, --dry-run          Print what would be done without actually performing it.
  -k, --kill-java        Kill all Java processes after each benchmark. Default: false
  -o, --options "<opts>" Pass additional options to the benchmark.sh script. Run "./benchmark.sh -h" for supported options.
  -s, --suspend          Suspend the system upon completion of the script. Default: false
  -h, --help             Show this help message and exit.

Please note that the default configured scenarios may take several hours to complete.

Approaches

All approaches use the same Spring Boot 3.2 version.

Scenarios

Default Scenarios

see src/main/resources/scenarios/scenarios-default.csv

Scenario Domain Description Virtual Users (VU) Requests per Second (RPS) Client delay (ms) Server delay (ms) Delay Call Depth
smoketest Time Smoke test 5 5 0 100 0
5k-vus-and-rps-get-time Time Constant users, constant request rate 5,000 5,000 0 100 0
5k-vus-and-rps-get-movies Movies Constant users, constant request rate 5,000 5,000 0 100 0
10k-vus-and-rps-get-movies Movies Constant users, constant request rate 10,000 10,000 0 100 0
10k-vus-and-rps-get-movies-call-depth-1 Movies Constant users, constant request rate 10,000 10,000 0 100 1
20k-vus-stepped-spike-get-movies Movies Stepped user spike 0 - 20,000 Depends on users and delays 1000 - 3000 (random) 100 0
20k-vus-smooth-spike-get-movies Movies Smooth user spike 0 - 20,000 Depends on users and delays 1000 - 3000 (random) 100 0
20k-vus-smooth-spike-get-post-movies Movies Smooth user spike 0 - 20,000 Depends on users and delays 1000 - 3000 (random) 100 0
20k-vus-smooth-spike-get-post-movies-call-depth-1 Movies Smooth user spike 0 - 20,000 Depends on users and delays 1000 - 3000 (random) 100 1

High-Load Scenarios

The scenarios examine particularly high load.

Multi-Client Scenarios

These scenarios compare both Spring Boot RestClient and WebClient implementations with each other.

All scenarios except for those tested with a webflux-netty approach use the WebClient or RestClient implementation specified in the scenario name. However, the webflux-netty approach always uses a fully reactive approach and therefore always uses the non-blocking WebClient.

The following clients are compared:

Other Scenarios

Steps

The benchmark run for each $scenario consists of the following phases and steps:

Before Benchmark

Benchmark

After Benchmark

Config

Common

Scenario-specific

Each line in src/main/resources/scenarios/scenarios-default.csv configures a test scenario which is performed first for Java Virtual Threads, then for WebFlux.

Example

scenario k6Config serverProfiles delayCallDepth delayInMillis connections requestsPerSecond warmupDurationInSeconds testDurationInSeconds
5k-vus-and-rps-get-time get-time.js 0 100 5000 5000 10 300
20k-vus-smooth-spike-get-movies] k6-20k-vus-smooth-spike-get-movies].js postgres 0 100 20000 0 300

Columns

  1. scenario: Name of scenario. Is printed on top of each diagram.
  2. k6Config: Name of the K6 Config File which is assumed to be in the config folder
  3. serverProfiles: Pipe-delimited Spring profiles which are also used to start and stop Docker containers. For example, specifying the value postgres|no-cache has these effects:
    • The Spring Boot profiles postgres,no-cache are added to the default Spring Boot profile of $approach.
    • The files src/main/docker/docker-compose-postgres.yaml and src/main/docker/docker-compose-no-cache.yaml (if existent) are used to start/stop Docker containers before/after each scenario run.
  4. delayCallDepth: Depth of recursive HTTP call stack to $approach/epoch-millis endpoint prior to server-side delay.
    • Mimics calls to upstream services which allow for reuse of the current platform thread.
    • For example, a value of 0 means that the service waits for $delayInMillis milliseconds immediately upon receiving a request.
    • Otherwise, it calls the $approach/epoch-millis with ${delayCallDepth - 1}.
    • This results in a recursive HTTP-request-based descent into the service, creating a call stack of depth $delayCallDepth.
  5. delayInMillis: Server-side delay of each request, in milliseconds. Mimics a delay such as invoking a DB which allow for reuse of the current platform thread.
  6. connections: Number of TCP connections, i.e. virtual users.
  7. requestsPerSecond: Number of requests per second across all connections. Left empty for scenarios where the number of requests per second is organically derived based on the number of connections, the request latency, and any explicit client-side delays.
  8. warmUpDurationInSeconds: Duration of the warm-up iteration before the actual test. Warm-up is skipped if 0.
  9. testDurationInSeconds: Duration of the test iteration.

Results

Test Environment

Hardware

Software

Charts

The following charts show the results of each scenario, sorted by ascending scenario load.

Errors

Any lines in the client-side or error-side log files which contain the term error (case-insensitive) are preserved. You can find them in error log files, located in the results folder alongside the generated PNG files.

Any failed requests appear both in the latency chart as red dots, as well as in the RPS chart as part of a continuous orange line. Additionally, they leave a trace in the $approach-latency.csv file, if preserved by running the benchmark with the -C option:

5k-vus-and-rps-get-time

This scenario aims to maintain a steady number of 5k virtual users (VUs, i.e. TCP connections) as well as 5k requests per second (RPS) across all users for 3 minutes:

Virtual Threads (Tomcat)

Loom

Virtual Threads (Netty)

WebFlux

WebFlux (Netty)

WebFlux

5k-vus-and-rps-get-movies

Like the previous scenario, but the response body contains a JSON of movies.

For further details, please see the movies section.

Virtual Threads (Tomcat)

Loom

Virtual Threads (Netty)

Loom

WebFlux (Netty)

WebFlux

10k-vus-and-rps-get-movies

Like the previous scenario, but 10 virtual users and requests per second.

Virtual Threads (Tomcat)

Loom

Virtual Threads (Netty)

Loom

WebFlux (Netty)

WebFlux

10k-vus-and-rps-get-movies-call-depth-1

Like the previous scenario, but mimics a request to an upstream service.

Virtual Threads (Tomcat)

Loom

Virtual Threads (Netty)

Loom

WebFlux (Netty)

WebFlux

20k-vus-stepped-spike-get-movies

This scenario ramps up virtual users (and thus TCP connections) from 0 to 20k in multiple steps, then back down:

Virtual Threads (Tomcat)

Loom

Virtual Threads (Netty)

Loom

WebFlux (Netty)

WebFlux

20k-vus-smooth-spike-get-movies

Like the previous scenario, but linear ramp-up and down.

Virtual Threads (Tomcat)

Loom

Virtual Threads (Netty)

Loom

WebFlux (Netty)

WebFlux

20k-vus-smooth-spike-get-post-movies

Like the previous scenario, but instead of just getting movies, we are now additionally saving them:

For further details, please see the movies section.

Virtual Threads (Tomcat)

Loom

Virtual Threads (Netty)

Loom

WebFlux (Netty)

WebFlux

20k-vus-smooth-spike-get-post-movies-call-depth-1

Like the previous scenario, but mimics call to upstream service as explained in 10k-vus-and-rps-get-movies-call-depth-1.

[!NOTE] For loom-netty and webflux-netty, this scenario was CPU-contended on the test environment upon reaching ca. 5,000 RPS. Whilst causing no errors, it drastically increased latencies.

Virtual Threads (Tomcat)

Loom

Virtual Threads (Netty)

Loom

WebFlux (Netty)

WebFlux

High Load Results

The following results are based on scenarios-high-load.csv which scales up to 60k users. They were executed in a VirtualBox VM on more powerful hardware and using a different Linux Kernel version.

Hardware

Software

Summary

Summary

60k-vus-smooth-spike-get-post-movies

Like 20k-vus-smooth-spike-get-post-movies, but scaling up to 60k users.

Virtual Threads (Netty)

Loom

WebFlux (Netty)

WebFlux