DescartesResearch / TeaStore

A micro-service reference test application for model extraction, cloud management, energy efficiency, power prediction, single- and multi-tier auto-scaling
https://se.informatik.uni-wuerzburg.de
Apache License 2.0
118 stars 137 forks source link

Strange results using HTTP Load Generator on Kubernetes #189

Open Jodao opened 3 years ago

Jodao commented 3 years ago

Hi, I'm testing out on Kubernetes the limit of req/s TeaStore can handle using a deployment of one pod per microservice. I'm running this test with a load model that increases 200 req/s every 2 min until it finishes with 4000 req/s. For that, I'm using 3 load generator machines( one of them is kubernetes control plane AKA master, the other 2 are just in the same network but not part of the cluster), 1000 threads on each and a timeout of 1000(1s). The thing is that i run the same experiment for a deployment where pods consume 1000m CPU (1 cpu) and 4Gb RAM and i get the same result for a deployment where pods consume 100/50/10m CPU and 4 Gb RAM. Obviously this is weird because when decreasing the CPU limit for pods I was expecting to see the successful requests to decrease. Any suggestions? Some help would be very appreciated, thank you

SimonEismann commented 3 years ago

That's a lot of requests :)

My initial intuition would be that this requires more load driver instances, but to actually figure this out would require the load driver logs + results. Could you share those? If you don't want to share them publically, you could also mail them to me :)

Jodao commented 3 years ago

Hi again, thank you for the fast reply. My main problem is, like i said, the succesfull requests not dropping when I decrease the pods' CPU limits on Kubernetes. There is no problem sharing those files, but as i get a lot of timeouts the log files are quite big. Because of that i had to zip them. I'm leaving here 2 zip files, one for an experiment with 1000m CPU and the other for 50m CPU. In each zip are present the logs of the 3 load generation machines and a log of the director machine. It is present as well the corresponding output of the experiment.
logs_&_output_50m_CPU.zip logs_&_output_1000m_CPU.zip

Thank you for your attention

SimonEismann commented 3 years ago

At first glance, the results look reasonable with the number of timeouts increasing with more load. However, in the output file, the load driver reports an average response time of 0. So it seems there is something going wrong here.

Could you run two test runs at low load (1 request per second and a thread count of 2), one with a timeout of 1 second and one without a timeout? Depending on if these setups result in the load driver reporting a response time or not, we'll know where to look further :)

Jodao commented 3 years ago

Hi, I guess I don't need to run those experiments because that's a well known issue on HttpLoad Generator( https://github.com/joakimkistowski/HTTP-Load-Generator/issues/11). Those registers of 0 on response time are due to the use of the flag "timeout". I didn't mention it because I'm not analyzing those values, I'm just looking to the successful requests that should decrease with fewer CPU resources on pods.

Anyway, I've ran those experiments you asked for a deployment of TeaStore where pods consume 50m CPU and 4GB RAM. The load generator was ran with only 1 load generator machine besides the master. The workload used generated 1 req/s over 500s.

I'm annexing 2 zip files which contain the results for each of the experiments(no timeout and timeout). Each zip has the output of the experiment and the logs of the load generator and director machines. no_timeout.zip timeout.zip

SimonEismann commented 3 years ago

Thanks for running the additional experiments. Its pretty interesting to observe that the version with the timeout set produces waaay more successful transactions than the version without timeouts (I would expect the opposite). Therefore, I'd assume that something is going wrong as soon as we introduce the timeout.

This means that we need to investigate in two directions: a) Why do we get so few successful responses (without timeout). Here, I would try the same experiment, but with 500-1000m CPU. b) What is happening once we introduce the timeout? Here, I'd assume that there is some bug in the loadgenerator that accidentally counts the timeouted requests as successes. At a first glance, I'd assume that the issue in the load driver is here: https://github.com/joakimkistowski/HTTP-Load-Generator/blob/master/tools.descartes.dlim.httploadgenerator/src/main/java/tools/descartes/dlim/httploadgenerator/http/HTTPTransaction.java#L82 For every other type of failed transaction, it throws a TransactionInvalidException exception. It should probably also do that if a request times out. Could you test if this makes the timeouts show up as failed requests?

Jodao commented 3 years ago

Hi, I apologize for this late reply, but I've been stuck with other things. Still am... Meanwhile, I've tried the experiments with another generator, locust, that has shown the behavior i expected to see. Related to your first suggestion, there is another problem with HttpLoad Generator which is that the simulation doesn't stop when the load generation stops. So, i think your second guess should be correct. There might be an error counting the time outed requests as successful ones. I can try that later if the approach with this new generator doesn't work. Anyway I will come back later to let you know what I've found. Thank you for your time