Open gothub opened 6 years ago
Tests were run on the cluster on 20180402.
The test workload consisted of 1000 quality report generation requests for a metadata document stored on disk, local to the quality engine. One worker pod contained the metadig-controller, which received the report requests and from 1-15 metadig-worker pods created the quality reports. Four test runs were conducted, each run with a different number of worker pods.
Each of the two nodes (ubuntu VMS) in the k8s cluster has 12 CPU cores and 16GB of memory.
run # | # worker pods | Elapsed time (minutes) | Ave. worker elapsed (seconds) | memory usage (GB) docker 1 | memory usage (GB) docker2 | load ave docker1 | load ave docker2 |
---|---|---|---|---|---|---|---|
1 | 1 | 21 | 1 | 1.2 | 4.7 | .49 | 1.1 |
2 | 5 | 4 | 1 | 7.3 | 6.6 | 3.6 | 2.7 |
3 | 10 | 2 | 1 | 11.2 | 12.8 | 4.2 | 6.0 |
4 | 15 | 5 | 4 | 15.3 | 15.4 | 79.1 | 59.2 |
The cluster performance appears to begin to degrade with only 15 workers, when the total cluster memory available approaches exhaustion.
The total cluster CPU usage was never close to maximum usage.
Starting with OpenJDK 8, the docker images available from https://hub.docker.com/_/openjdk/
have some level of optimization for running inside a container. Here is a description from Docker Hub for the java
options recommended for use inside containers:
Inside Linux containers, recent versions of OpenJDK 8 can correctly detect container-limited number of CPU cores by default. To enable the detection of container-limited amount of RAM the following options can be used:
java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap ...
These arguments might also be used in conjunction with -Xms
, which specifies the initial Java heap size
Note that the above reports statistics for this performance test did not use any of these java
command line arguments, but another
round of tests will be run using these arguments.
The current docker image for metadig-worker
is 173MB in size, and uses the openjdk:8-jre-alpine
docker tagged image. This image seems to load quickly, as metadig-worker
pods are created in a few seconds. The default docker image openjdk:8
, is about 800MB and takes almost a minute to load.
The docker image might be made even smaller using other image variants, however this might introduce other problems as non-standard libraries are used for some of the smaller image variants.
Nice results. It may be worth running it several times to get an average and variance for the worker memory before trying to optimize. Regardless, at some point we will be memory limited, so need to be sure to put in a memory quota for the namespace that prevent more workers being spawned than we can handle with available memory. These constraints can be specified as described in the docs: https://kubernetes.io/docs/tasks/administer-cluster/quota-memory-cpu-namespace/
The next round of tests used a metadig-worker
container using the container openjdk:8-jre-alpine
, with java memory limits specified. Here is an excerpt from the Dockerfile
:
CMD java -Xms128m -Xmx256m -cp ./metadig-engine.jar edu.ucsb.nceas.mdqengine.Worker
run # | # worker pods | Elapsed time (minutes) | Ave. worker elapsed (seconds) | memory usage (GB) docker 1 | memory usage (GB) docker2 | load ave docker1 | load ave docker2 |
---|---|---|---|---|---|---|---|
1 | 1 | 21 | 1 | 1.2 | 4.7 | .49 | 1.1 |
2 | 5 | 4 | 1 | 2.0 | 2.3 | 1.68 | .50 |
3 | 10 | 2 | 1 | 3.6 | 4.4 | 4.2 | 4.4 |
4 | 15 | 2 | 1 | 4.4 | 5.5 | 5.8 | 7.2 |
5 | 25 | 2 | 2 | 6.1 | 7.7 | 8.8 | 12.6 |
6 | 35 | 2 | 3 | 8.2 | 9.2 | 16.8 | 17.1 |
7 | 45 | 2 | 3 | 10.4 | 10.8 | 22.9 | 20.23 |
8 | 55 | 2 | 5 | 11.7 | 13.3 | 25.3 | 26.2 |
9 | 65 | 2 | 7 | 13.2 | 15.7 | 29.9 | 36.6 |
An attempt was made to run 75 workers, but not all containers would start, i.e. they would stay in the ` "container creating" state or "crash restart".
This next round of tests was run on the reconfigured k8s cluster, with three nodes: docker1, docker2, docker3. The node 'docker1' is the master and is currently set to not run user pods.
These tests were submitted to k8s via the NGINX 'ingress' controller that is listening to port 30080 on docker1. NGINX receives the requests and then forwards them directly to the 8080 port of the 'metadig-controller' container running Apache Tomcat. This single container runs the metadig-webapp, which includes the metadig-controller class. Metadig-controller then queues requests to the rabbitmq which is running in a separate container. The metadig-worker containers are registered with this rabbitmq container and so 'consume' queued requests.
The current cluster configuration (Nginx, metadig-controller, metadig-worker, rabbitmq containers) is the proposed production configuration. The previous tests didn't include an ingress controller or metadig-controller containers, so were essentially only testing worker performance.
Also for this round of testing, additional items were recorded in order to determine the load on the metadig-controller and rabbitmq containers themselves.
This round of tests used 1000 documents obtained from KNB, run against the knb.suite.1
quality suite
# requests | # workers | total elapsed time (minutes) | ave. elapsed time metadig-worker (seconds) | load ave docker1 | load ave docker2 (32 CPUs) | load ave docker3 (24 CPUs) | load ave rabbitmq | load ave metadig-controller | mem docker2 (GB) | mem docker3 (GB) | mem rabbitmq (GBI) | mem metadig-controller (GB) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 10 | 28.0 | 14.85 | 0.58 | 26.0 | 19.24 | 27.08 | 41.4 | 24.3 | 24.7 | ||
1000 | 50 | 23.0 | 25.18 | 0.87 | 79.06 | 65.48 | 71.4 | 44.6 | 25.7 | 44.7 | ||
1000 | 100 | 13.0 | 39.48 | 0.79 | 114.10 | 87.85 | 122.20 | 66.5 | 46.1 | 67.2 | ||
1000 | 150 | 15.0 | 45.14 | 0.74 | 129.91 | 103.41 | 139.51 | 132.05 | 94.5 | 69.7 | 92.5 | 93.2 |
In order to check the feasibility of using Kubernetes (k8s) for use with metadig engine, representative workloads will be run on the k8s cluster (docker-ucsb-1.test.dataone.org, docker-ucsb-2).