Poor performance testing results

qfcgc commented 2 years ago

Describe the bug

Under heavy load during performance testing, Zeebe brokers crash from time to time. We checked Grafana graphs but we couldn't understand the root cause of all the deviations. There is a performance testing report attached, there are screenshots and comments from our 12-hour performance testing

To Reproduce

We encounter such incidents from time to time while testing the performance of our system. Under heavy load, the probability of incidents is higher, but it happens quite unexpectedly all the time

Expected behavior

Under heavy load, the system is stable

Environment:

OS: Linux
Zeebe Version: 8.0.5
Configuration: 12 brokers, 20 partitions

Attached files:

Zeebe Performance Testing report.pdf

saig0 commented 2 years ago

The topic was first raised in the Slack channel: https://camunda-platform.slack.com/archives/C6WGNHV2A/p1661953261878119

Zelldon commented 2 years ago

@qfcgc we need some more details regarding your setup.

In what environment you're running? K8? Which cloud provider? Are you using the helm charts? How does your configuration look like? Please provide these details otherwise it is hard to help you here. Performance results are highly dependent on configuration and resources.

qfcgc commented 2 years ago

@Zelldon, thank you for your response!

We use the following configuration:

brokers + gateways: ltamzebro01...ltamzebro12 (12 servers): Red Hat Enterprise Linux 7 (64 bit), virtual server, 50 Gb HDD, 12 CPU, 16 Gb RAM workers: api,api1...api4 (5 servers): Red Hat Enterprise Linux 7 (64 bit), virtual server, 20 Gb HDD, 4 CPU, 15 Gb RAM

Our workers (within Java applications) are managed by Mesos Marathon (with plans to move to K8s). Our brokers and gateways are launched via Ansible scripts, and we manage them manually. Applications are containerized with Docker, and brokers and gateways are also launched within containers. The following scripts are used:

broker:

docker run --rm \
    -m 4500m \
    -v /etc/zeebe_broker/mobile-blue:/usr/local/zeebe/config \
    -v /data/zeebe_broker/mobile-blue:/usr/local/zeebe/data \
    -v /etc/zeebe_broker/mobile-blue/zeebe-kafka-exporter-3.0.0-jar-with-dependencies.jar:/usr/local/zeebe/lib/zeebe-kafka-exporter-3.0.0-jar-with-dependencies.jar \
    -v /etc/zeebe_broker/mobile-blue/kafka-clients-2.8.0.jar:/usr/local/zeebe/lib/kafka-clients-2.8.0.jar \
    -v /etc/zeebe_broker/mobile-blue/zeebe-hazelcast-exporter-1.0.1-jar-with-dependencies.jar:/usr/local/zeebe/lib/zeebe-hazelcast-exporter-1.0.1-jar-with-dependencies.jar \
    -p 35500:35500 \
    -p 35501:35501 \
    -p 35502:35502 \
    -e "JAVA_OPTS=-Xms128m -Xmx4000m" \
    -e SERVER_PORT=35500 \
    --name zeebe-broker-mobile-blue domain.name/camunda/zeebe:8.0.5

gateway:

docker run --rm \
    -e ZEEBE_STANDALONE_GATEWAY=true \
    -e SERVER_PORT=36501 \
    -m 1024m \
    -v /etc/zeebe_gateway/mobile-blue:/usr/local/zeebe/config \
    -v /etc/zeebe_gateway/mobile-blue/keycloak-interceptor/:/tmp/ \
    -e "ZEEBE_GATEWAY_SECURITY_KEYCLOAK_CONFIG_PATH=/usr/local/zeebe/config/zeebe-keycloak-config-blue.json" \
    -e "JAVA_OPTS=-Djavax.net.ssl.trustStore=/usr/local/zeebe/config/certs" \
    -e "CONFIG_FORCE_zeebeKeycloak_clientSecret=our_secret_is_here" \
    --network host \
    --name zeebe-gateway-mobile-blue domain.name/camunda/zeebe:8.0.5

Config files:

Zeebe broker: zeebe_broker_config.txt

Zeebe gateway: zeebe_gateway_config.txt

qfcgc commented 2 years ago

We've encountered a new incident where our broker failed, and it is interesting - it happened 30 minutes after performance testing ended. We can see on different graphs how the latency of various events has increased. Please have a look at it. The reason could be similar because it is the same environment. The file with Grafana graphs of this incident is attached. Zeebe Performance Testing note, incident after testing.pdf

Zelldon commented 2 years ago

@romansmirnov Do you have time to take a look at that? I'm still ooo.

Zelldon commented 2 years ago

Hey @qfcgc

sorry again for the delay! I took the time today, to look into your issue.

Summary:

Just to summarize (also for myself):

You're not in production, but you're running benchmarks (performance tests) and they seem to fail. The benchmarks take around 12 hours. You have seen "incidents" happen from time to time under high load.

Config

Your configuration looks like the following:

bare metal: Our brokers and gateways are launched via Ansible scripts, and managed manually.
brokers + gateways: ltamzebro01...ltamzebro12 (12 servers): Red Hat Enterprise Linux 7 (64 bit), virtual server, 50 Gb HDD, 12 CPU, 16 Gb RAM
workers: api,api1...api4 (5 servers): Red Hat Enterprise Linux 7 (64 bit), virtual server, 20 Gb HDD, 4 CPU, 15 Gb RAM
Partition count 20
cluster size 12
repl 3

Broker:

docker run --rm \
        -m 4500m \
        -v /etc/zeebe_broker/mobile-blue:/usr/local/zeebe/config \
        -v /data/zeebe_broker/mobile-blue:/usr/local/zeebe/data \
        -v /etc/zeebe_broker/mobile-blue/zeebe-kafka-exporter-3.0.0-jar-with-dependencies.jar:/usr/local/zeebe/lib/zeebe-kafka-exporter-3.0.0-jar-with-dependencies.jar \
        -v /etc/zeebe_broker/mobile-blue/kafka-clients-2.8.0.jar:/usr/local/zeebe/lib/kafka-clients-2.8.0.jar \
        -v /etc/zeebe_broker/mobile-blue/zeebe-hazelcast-exporter-1.0.1-jar-with-dependencies.jar:/usr/local/zeebe/lib/zeebe-hazelcast-exporter-1.0.1-jar-with-dependencies.jar \
        -p 35500:35500 \
        -p 35501:35501 \
        -p 35502:35502 \
        -e "JAVA_OPTS=-Xms128m -Xmx4000m" \
        -e SERVER_PORT=35500 \
        --name zeebe-broker-mobile-blue domain.name/camunda/zeebe:8.0.5

Gateway:

docker run --rm \
        -e ZEEBE_STANDALONE_GATEWAY=true \
        -e SERVER_PORT=36501 \
        -m 1024m \
        -v /etc/zeebe_gateway/mobile-blue:/usr/local/zeebe/config \
        -v /etc/zeebe_gateway/mobile-blue/keycloak-interceptor/:/tmp/ \
        -e "ZEEBE_GATEWAY_SECURITY_KEYCLOAK_CONFIG_PATH=/usr/local/zeebe/config/zeebe-keycloak-config-blue.json" \
        -e "JAVA_OPTS=-Djavax.net.ssl.trustStore=/usr/local/zeebe/config/certs" \
        -e "CONFIG_FORCE_zeebeKeycloak_clientSecret=our_secret_is_here" \
        --network host \
        --name zeebe-gateway-mobile-blue domain.name/camunda/zeebe:8.0.5

Analysis

Based on my summary and your reports I tried to understand what the issue is you're facing, but I'm not 100% sure yet what the actual issue is. Please be aware that leader changes can and will happen in a distributed system like Zeebe. This should be expected this is not a failure.

Due to this I have some open questions which would be great if you could answer them.

Open Questions:

What is the actual problem you are facing?
What is the impact? Please be specific.
Do you have numbers of disk throttling? IOPs throttling count or something similar? Please be aware that Zeebe is really IO intense.
How does the commit latency look like? Write latency? How does the processing queue look like?
What is the reason for the chosen configuration, speaking of the partition count and cluster size.
Are you running gateway and broker on same machines?
What is the expected load?
How does your processes look like?

Suggestions

Even if I don't understand full yet the real issue/incident I think I already have some hints and pointers and suggestions you could take a look at.

Partitions

Your current configuration looks like this:

$ ./partitionDistribution.sh 12 20 3
Distribution:
P\N|    N 0|    N 1|    N 2|    N 3|    N 4|    N 5|    N 6|    N 7|    N 8|    N 9|    N 10|   N 11
P 0|    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  
P 1|    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  
P 2|    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  |    -  
P 3|    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  
P 4|    -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  
P 5|    -  |    -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  
P 6|    -  |    -  |    -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  
P 7|    -  |    -  |    -  |    -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  
P 8|    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  
P 9|    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    L  |    F  |    F  
P 10|   F  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    L  |    F  
P 11|   F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    L  
P 12|   L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  
P 13|   -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  |    -  |    -  
P 14|   -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  |    -  
P 15|   -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  |    -  
P 16|   -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  |    -  
P 17|   -  |    -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  |    -  
P 18|   -  |    -  |    -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  |    -  
P 19|   -  |    -  |    -  |    -  |    -  |    -  |    -  |    L  |    F  |    F  |    -  |    -  

Partitions per Node:
N 0: 4
N 1: 5
N 2: 6
N 3: 6
N 4: 6
N 5: 6
N 6: 6
N 7: 6
N 8: 5
N 9: 4
N 10: 3
N 11: 3

This means that most of the work is done by Node 2-7, which might be not ideally especially if all get the same resources. Might be worth to adjust the partition distribution.

Regarding resources I would recommend per partition on a node to assign ~500 mb and 1-2 CPU cores. 2 CPU especially if you run exporters, which you seem to do.

Disk

As written earlier Zeebe is really IO intense, you have one disk assigned which is an HDD. If you have 6 partitions on a node it is likely that you running into IO throttling issues. I would always advise to use SSD with Zeebe.

Gateway

I hope you're not running the gateway on the same nodes as the brokers, otherwise it is likely that you also running into CPU contention.

Your gateway has only one thread available to process all requests, you should increase that. (managementThreads: 1) Especially since you seem to use interceptors to connect with keycloak (?), which will also add delay on the requests.

I hope my suggestions already help a bit and maybe you can already rerun your benchmark. Would be great if you can answer my above questions as well.

Greets Chris

Zelldon commented 2 years ago

Please reopen if you have new insights.

camunda / camunda