Closed Angi2412 closed 3 years ago
These results do indeed look weird :)
Before we can dive deeper into what might be going on, could you help me better understand your experiment setup?
As far as I understand, you have deployed all tea store services and now you are investigating the impact of different configuration parameters for the WebUI container (CPU, Memory, #pods) with 5 potential values for each parameter. Here you measured every possible configuration (125) seven times (--> 875 experiments)
Can you describe how each individual experiment is configured:
Thanks again for the fast response.
replace_namespaced_deployment
command. Only the WebUI microservice has limits the others don't -> here.average response time = response_latency_ms_sum / response_latency_ms_count
A few things of the top of my hat:
Overall, your setup looks pretty good, to be honest.
Could there be any weird dependency on the other services that explains this behaviour?
In a 3-minute experiment with minimum resources, locust counts 306 requests (3.1 per second), from which 117 failed (1.5 per second).
Your experiment setup looks reasonable in terms of exp durations/load/cooldown times.
So my theory right now would be that the failed request are the reason that the reported response times look so weird, so let's try getting rid of those first. Here I would look into two things:
In terms of expected requests your configuration can handle, I don't have any experience with your exact container sizes, but in this paper: https://doi.org/10.1145/3358960.3379124 we had resource requests of 420m, and limits of 2000m, a setup with 8 pods for each service was able to handle at least 900 requests/seconds. So my gut feeling would be 1.5 req/s seem okay for a 300m instance, but definitely somewhat on the lower side.
I did a run with 27 (3 x 3 x 3) iterations and 50 users with a spawn rate of 1 per second. The parameter variations were as follows: CPU limit: 100m, 200m and 300m Memory limit: 500Mi, 600Mi and 700Mi Number of Pods: 1, 2 and 3
The resulting plots are still irritating. This image shows the relationship for each parameter with the target average response time. The not observed parameters are each set to minimum, median and maximum.
The most occurring error codes are 500
, 302
and 404
. For example in the minimum variation they happened the following amount:
Error Code | Amount |
---|---|
302 | 6 |
404 | 12 |
500 | 14 |
I did now filter the average response time so that only requests with a response smaller than 300 are taken into account. The overall average response time is now lower, but the course of the plots are still the same except for the number of pods with maximum CPU and memory limit.
Is there maybe any kind of dataset already available?
So there still seem to be some things off, but the results already look better. I think to dive deeper here, we probably need some more details, here are some ideas on what might help:
kubectl get pod
show any pod restarts?If you want, we can also schedule a skype call to go over this (would need to switch to e-mail to exchange skype ids).
A skype call would be great. My e-mail address is provided on my profile page.
Hello,
I used the application to make some test runs with locust, but the resulting data makes not that much sense for me. Maybe you have an idea to explain that behaviour?
The shown data is from scaling the "WebUI" microservice and consists of seven runs (875 Datapoints). Each run consists of 125 different parameter variations. This means each pod resource (CPU limit, memory limit and the number of pods) has five expressions (5x5x5 = 125):
Now if I look at the correlation matrix, the number of pods does not seem correlated to the average response time?
And then when I, for example, look at the relationship between average response time and CPU limit, which are correlated negatively it makes no sense... Here the memory limit is set to its median:
These are just examples.. there are more irritating results...