cncf / demo

Demo of CNCF technologies
https://cncf.io
Apache License 2.0
77 stars 39 forks source link

Benchmark design proposal #106

Open namliz opened 7 years ago

namliz commented 7 years ago

Note: This is very tentative and is pending discussion.

Egress

HTTP Load is generated with WRK Pods, scriptable via Lua, and auto scaled by increments of 1000 rps (Each pod makes one thousand concurrent requests), up to 100 pods max.

WRK pods are pinned with the affinity mechanism to set of nodes A.

Given that this works out to under 100MB/s of traffic in aggregate -- this is not a high bar -- carefully picking the right instance type we can dial it in so this requires the predetermined amount of nodes of our choosing.

Its a good idea to have the number of nodes equal (or be a multiple of) the number of availability zones in the region the benchmark will run in. For 'us-west-2' that would be three.

Instance type selection is with the intention of picking the smallest/cheapest type that is still beefy enough to generate enough load with just those three nodes.

Ingress

Countly API pods are similarly pinned to set of nodes B. Again, as few as three nodes (with one pod per) are required. This provides redundancy but also mirrors the Egress described above and thus controls for variance in traffic between pods in different availability zones.


The autoscaling custom metric

The WRK pods report summaries with latency statistics and and failed requests. Its possible to latch unto this error rate to provide the custom metric.

It seems at the moment this requires a bit of trickery, for example:

Pod1: 1000 requests made, 12 time outs, 31 errors Pod2: 1000 requests made, 55 time outs, 14 errors Pod3: 1000 requests made, 32 time outs, 55 errors

The autoscaler is actually provided a target. Assuming we want to tolerate no more than a 10% bad requests (errors + timeouts) we'd provide a target of 100.

Based on the above the autoscaler will keep launching additional pods, the load will increase, so will the error rates and time outs, until an equilibrium is reached.


The backend (Mongo cluster)

Mongo pods are pinned to set of nodes C. These are CPU, Memory, and Disk I/O intensive. The number of these pods and nodes is fixed.

The background job (Boinc)

Boinc pods are pinned to nodes A,B but not C. They are scaled to soak up available CPU on these nodes which will otherwise be under utilized.

namliz commented 7 years ago

Boinc utilization fingerprint

It takes a minute to warm up, there's a small amount of network activity as it negotiates with the grid and gets a task, after which the pod will basically use all the CPU it is given. Start enough of these in parallel and the jitter smooths out.

leecalcote commented 7 years ago

It'd be good to clarify what we're trying to measure (e.g. Countly’s performance/reaction under load or Kube-proxy’s ability to keep up) and what we theorize the results to be. Reducing as many variables as possible (or allowing them to be fixed) in the experiment will help. Positively, even if the benchmark itself doesn’t produce valuable results, the Prometheus node exporter will be valuable.

namliz commented 7 years ago

One of the main points of benchmarking Kubernetes

I think we should be trying to show a cluster that across the board efficiently utilizes all of its resources. So memory, cpu, networking, and disk i/o, everything should hover around 85%-90% on all the nodes. And the sooner into the benchmark the better. Having idle resources in the cloud is wasting your money.

In theory, recreating this benchmark outside of Kubernetes the "old fashioned way" will require some multiple of server instances and thus cost more.

If we re-run this benchmark on different clouds we can, in theory, have price/performance numbers for each.

For instance: on cloud X to process 1 billion requests @ 100,000 requests per second it will take 3 hours and cost Y, or, with more servers, 30 minutes and cost Y*Z. Here's some measurements of various combinations. Sweet spots, points of diminishing return.


Currently the main effort is around spreading the pods around in the cluster in such a way that ensures a fair benchmark.

The autoscaling bit will keep launching load generating pods until Countly is maxed out. That's basically needed functionality in order to run the benchmark against clusters of arbitrary size.

Say we want to benchmark 100,000 requests per second which translates (just for the sake of discussion) to 10MB/s.

If the network itself proves to be maxed, perhaps we're using micro nodes on AWS and they can't push more than 9MB/s, we'll simply select bigger instance types so that's not a bottleneck.

It is desirable to max out utilization as much as possible without going over the edge as it were.

Once those combos are dialed in the network I/O is efficiently/maximally utilized on the load generating pods. The next step would be to hopefully deploy Boinc pods into that group so the CPU gets maxed out as well (or until it starts interfering with the load generation).

I theorize the Countly group will use a moderate to small amount of cpu and a small amount of memory, with networking similarly maxed out -- as its not supposed to be doing almost anything except forward traffic to mongo, which isn't very demanding.

The mongo/backend group will use a moderate amount of cpu & networking. It will however use a high amount of disk I/O and memory (especially if we start tweaking things like indexes).

Just like with networking if we dial in the combination of expected writes and types of disks we select it will demonstrate something akin to maximal utilization (If only the real world was so predictable for capacity planning).

If you're paying attention you can see we still would not be at full utilization, still holes to plug, we'll have to dream up what else to deploy into the cluster to max it out.

jeremyeder commented 7 years ago

Hey all -- I wanted to point you to some similar work we're doing to put load on a Kube cluster. It's a part of our "cluster-loader" utility which was used on the CNCF environment last month. We have developed and extended it further with a feature simply titled the wlg, or workload generator.

It is intended to put any mix of arbitrary, load on a cluster. The cluster-loader is designed to use templates as it's primary means of defining applications. Workload-generators are thus nothing more than templates. So far we've built workload generators for Jmeter, syslog and stress. The Jmeter one, which is very similar to what you've got in this issue, scans a Kube cluster for routes and then dynamically builds a jmeter test plan based on those routes plus user input (yaml config). It then graphs the results and we've got most of the work done to ship results to ElasticSearch.

Other templates would put load on the logging infrastructure by generating random data at certain rates to flow through (in case of RH OpenShift implementation, fluentd and on to elasticsearch).

Another template is a batch/busy pod using the simple stress utility. This is to simulate a CPU-bound workload that does not do any I/O.

We have other templates planned as well to put network traffic on the cluster (different than http), as well as I/O to stress out the storage subsystems in various ways that represent apps we expect to run on Kube.

These tests are all wrapped in sophisticated metrics-gathering utilities that do things like use go pprof to trace the daemons, gather/graph all of the typical system metrics, etc.

Taken in the aggregate, cluster-loader is an easy, eminently flexibly front-end to create a bee-hive of realistic activity on even the largest Kubernetes clusters.

I hope you can at least take a look at what we're building and if you're interested, possibly collaborate together.

cluster-loader https://github.com/openshift/svt/tree/master/openshift_scalability

wlg PR (should be merged soon) https://github.com/openshift/svt/pull/65

namliz commented 7 years ago

@jeremyeder, this is very interesting.

Pretty obvious there's some overlap between our needs and approaches. I like the templates idea.

Currently the tools I've selected are Wrk for the HTTP load generation and Falcon (a performant Python microservices framework) for an echo/mocking server to simulate behaviour of services under various loads.

Flacon was selected by going over this extensive list of framework benchmarks in many languages. Check out how they define a benchmark_config.json for each test/tool. So I definitely agree on the appeal of that and how you're using templates.

I'll mention that such a structure naturally emerges with well organized Kubernetes yaml files but perhaps a pure 'test definition' split out and templated is a win. I'm actually already using Jinja elsewhere to generate Kuberentes yaml so it will be straightforward to try out.

These tests are all wrapped in sophisticated metrics-gathering utilities that do things like use go pprof to trace the daemons, gather/graph all of the typical system metrics, etc.

I'd love for you to elaborate on that, visualization is rather critical, so far what I'm doing is a lot of trial and error to get a feel for the cluster and behavior (even with Grafana in the mix).

I definitely think a generic test runner/wrapper would be useful for the community. How do you prefer I contact you to discuss this further? I'll be going over your stuff and figuring out how to potentially fit this together.

jeremyeder commented 7 years ago

Great! You can get me on CNCF slack in #cluster as @jeremyeder

leecalcote commented 7 years ago

@zilman "Currently the main effort is around spreading the pods around in the cluster in such a way that ensures a fair benchmark." Got it. This is more about testing the scheduler, the behavior of different strategies and performance of kube-scheduler.

leecalcote commented 7 years ago

@jeremyeder thanks for calling out the cluster-loader (enjoyed your post). Is this tethered to OpenShift or Ansible in any way?

jeremyeder commented 7 years ago

Ansible no. OpenShift, there is a soft dependency there in that we added a "-k" option to it that flips to start using kubectl. It's not tested thoroughly at the moment but it does at least work for some of our density tests that we run on Upstream kube.

Let's please find some time to hash this out. We have a refactor coming which has enumerated (among other things) the need to create a library for API interactions that can work on both Kube and OpenShift. And that refactor would be the perfect opportunity to resolve this overlap that @zilman has described. One tool that works everywhere and is as flexible as we can make it...we want to achieve this.

gmarek commented 7 years ago

cc @gmarek @wojtek-t @fgrzadkowski

timothysc commented 7 years ago

@zilman This really seems like tooling that should live here: https://github.com/kubernetes/perf-tests

Thoughts?

namliz commented 7 years ago

@timothysc that makes sense to me. We're having a call about this tomorrow, you're welcome to join, feel free to email me.

Lets move the discussion to kubernetes/perf-tests#5.

gmarek commented 7 years ago

FYI: we're also planning on building something similar, though we have other goal. Namely we want to have a cluster-load (to reuse the terminology) to find scalability bottlenecks (whether it'd be an API server(s), kube-proxies, etcd, some controller).

This means we don't really care about fair benchmarking and comparing different providers, but we certainly care about the tools that can stress the cluster, which in turn certainly should live in https://github.com/kubernetes/perf-tests as @timothysc mentioned.

pskrzyns commented 7 years ago

cc @ss7pro @mzylowski

gmarek commented 7 years ago

You're bringing up very important issue. The answer strongly depends on what you want to get from the test. Pinning up Pods to Nodes will take away all scheduling logic from the picture. Using Pod(Anti)Affinity is a bit better, as long as it's scalable and allow some freedom (e.g. having only one Node on which some Pod can run is probably a bad idea, as this Node can die at any point of time). So if one want to test 'whole cluster experience' we need to try to use as "normal" configuration as possible, so improvements in scheduling algorithm would be visible.

OTOH if you're interested in benchmarking e.g. network throughput, or some other isolated parts of the system, then having very rigid config makes a lot of sense, as it'll make test results more reproducible.

As for how we exactly want to load the cluster - it's probably a subject to discuss on @kubernetes/sig-scalability meeting, as there are a lot of people with a lot of experience in this topic.

On Tue, Sep 27, 2016 at 5:13 PM, Eugene notifications@github.com wrote:

@gmarek https://github.com/gmarek what's your approach to finding scalability bottlenecks?

You might not care about comparing providers but I think you might want to ensure fair benchmarking. Simplest example I can think of:

Two minion cluster.

Minion A is running an echo server (just responds 200 OK to everything). Minion B is running load generating pods.

If you don't pin the load generating pods to Minion B, or rather "not Minion A", your numbers would start meaning something very different and the nature of the experiment changes. It's a whole other thing, that's really two separate tests that could find different kinds of bottlenecks.

So I'm thinking the most basic testfile is just a nice wrapper around a well known benchmarking utility, but we should consider a way to describe a few simple constrains - how the kubernetes environment should look - to ensure correctness.

BTW, maybe we should pick or start a slack channel for this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cncf/demo/issues/106#issuecomment-249894506, or mute the thread https://github.com/notifications/unsubscribe-auth/ABn_oiWeMJR1Ef8tD4VmCyUCtsHXcwneks5quTKegaJpZM4KBOLW .

namliz commented 7 years ago

@gmarek you got it, exactly right.

So there are situations where you might want to do both, pinning to nodes acts as a control. For instance, like you said, to simulate network throughput artificially and establish a baseline.

That sort of thing is helpful to design benchmarks that are more "normally" configured.

So since we're using the word 'framework' the aforementioned config files should allow for it in my opinion.

gmarek commented 7 years ago

Fair enough.