Irritating results - Githubissues

Angi2412 commented 3 years ago

Hello,

I used the application to make some test runs with locust, but the resulting data makes not that much sense for me. Maybe you have an idea to explain that behaviour?

The shown data is from scaling the "WebUI" microservice and consists of seven runs (875 Datapoints). Each run consists of 125 different parameter variations. This means each pod resource (CPU limit, memory limit and the number of pods) has five expressions (5x5x5 = 125):

CPU limit: 100m – 300m
Memory limit: 500Mi – 700Mi
Number of pods: 1 - 5

Now if I look at the correlation matrix, the number of pods does not seem correlated to the average response time?

And then when I, for example, look at the relationship between average response time and CPU limit, which are correlated negatively it makes no sense... Here the memory limit is set to its median: cpu_limi_average _response_time_median

These are just examples.. there are more irritating results...

SimonEismann commented 3 years ago

These results do indeed look weird :)

Before we can dive deeper into what might be going on, could you help me better understand your experiment setup?

As far as I understand, you have deployed all tea store services and now you are investigating the impact of different configuration parameters for the WebUI container (CPU, Memory, #pods) with 5 potential values for each parameter. Here you measured every possible configuration (125) seven times (--> 875 experiments)

Can you describe how each individual experiment is configured:

Do you deploy the services freshly for each experiment
How many requests per second are you using?
How long is each experiment?
Are you using the browse load profile?
Where are the locust master/slave deployed?
What other parameters is locust using? Is there something like virtual users with locust?
As the load profile consists of multiple request types, do you calculate a mean response time over all request classes at the end?

Angi2412 commented 3 years ago

Thanks again for the fast response.

For each run (125 iterations), I deploy the application freshly (delete old namespace -> create new namespace -> deploy application). Between each configuration, I update the parameters for the WebUI microservice with the local Kubernetes replace_namespaced_deployment command. Only the WebUI microservice has limits the others don't -> here.
With locust, I use 300 users with a spawn rate of 2 per second
Every experiment has a duration of 3 minutes.
I am using the buy profile
I start locust locally (not in Kubernetes and not distributed) and make the request over the Node IP
I don't think so. The only parameters I use are users, spawn rate and time
I calculate the average response time independent of the request class. It is calculated with Prometheus metrics gathered by the service mesh linkerd: average response time = response_latency_ms_sum / response_latency_ms_count

SimonEismann commented 3 years ago

A few things of the top of my hat:

Have you tested that you can manually access the teastore with this deployment?
You could try replacing the nodeIP solution with an Ingress e.g., NGINX
The workload of 300 users seems quite high, based on your limits I would expect it to be able to handle less load. How many requests does this configuration results in over the 3-minute experiment?

Overall, your setup looks pretty good, to be honest.

Angi2412 commented 3 years ago

Yes, I tested that. Also between every experiment, there are min. 2 minutes of cooling time and I make a request to the WebUI status API if it's ready. Between every run, there is a cooling time of 5 minutes. This should ensure that all services are up running before the load is executed.
How would that affect the runs?
I will check how many requests are made. How many requests would you expect the service to be able to handle with the maximum configuration of CPU: 300m, Memory: 700Mi and 5 pods?

Could there be any weird dependency on the other services that explains this behaviour?

Angi2412 commented 3 years ago

In a 3-minute experiment with minimum resources, locust counts 306 requests (3.1 per second), from which 117 failed (1.5 per second).

SimonEismann commented 3 years ago

Your experiment setup looks reasonable in terms of exp durations/load/cooldown times.

So my theory right now would be that the failed request are the reason that the reported response times look so weird, so let's try getting rid of those first. Here I would look into two things:

If you run it with half the load, are there no more failed requests?
What error code do the failed requests report?

SimonEismann commented 3 years ago

In terms of expected requests your configuration can handle, I don't have any experience with your exact container sizes, but in this paper: https://doi.org/10.1145/3358960.3379124 we had resource requests of 420m, and limits of 2000m, a setup with 8 pods for each service was able to handle at least 900 requests/seconds. So my gut feeling would be 1.5 req/s seem okay for a 300m instance, but definitely somewhat on the lower side.

Angi2412 commented 3 years ago

I did a run with 27 (3 x 3 x 3) iterations and 50 users with a spawn rate of 1 per second. The parameter variations were as follows: CPU limit: 100m, 200m and 300m Memory limit: 500Mi, 600Mi and 700Mi Number of Pods: 1, 2 and 3

The resulting plots are still irritating. This image shows the relationship for each parameter with the target average response time. The not observed parameters are each set to minimum, median and maximum. Observations

The most occurring error codes are 500, 302 and 404. For example in the minimum variation they happened the following amount:

Error Code	Amount
302	6
404	12
500	14

Angi2412 commented 3 years ago

I did now filter the average response time so that only requests with a response smaller than 300 are taken into account. The overall average response time is now lower, but the course of the plots are still the same except for the number of pods with maximum CPU and memory limit.

filtered status code

Angi2412 commented 3 years ago

Is there maybe any kind of dataset already available?

SimonEismann commented 3 years ago

So there still seem to be some things off, but the results already look better. I think to dive deeper here, we probably need some more details, here are some ideas on what might help:

Does kubectl get pod show any pod restarts?
Would it be possible to post the detailed response time data? So response time per experiment and maybe even per request?
Do you have utilization logs? This would also help to exclude some potential explanations

If you want, we can also schedule a skype call to go over this (would need to switch to e-mail to exchange skype ids).

Angi2412 commented 3 years ago

A skype call would be great. My e-mail address is provided on my profile page.

DescartesResearch / TeaStore

Irritating results #180