Load testing - Githubissues

david4096 commented 8 years ago

Hi everyone! In order to prepare for our demonstration server with the 1kg VCFs I've started to perform some rudimentary single node load testing. Both server and client are in the same region of Microsoft's cloud.

The infrastructure is as follows:

Server (latest source):

Ubuntu 14.04.4 LTS
Apache/2.4.7 (Ubuntu)
Python 2.7.6
MaxKeepAliveRequests 100
KeepAliveTimeout 5
             total       used       free     shared    buffers     cached
Mem:          440G       207G       232G       199G        72M       204G
32 cores of Intel(R) Xeon(R) CPU E5-2698B v3 @ 2.00GHz

Client

Ubuntu 15.10
Python 2.7.10
             total       used       free     shared    buffers     cached
Mem:           55G        10G        44G       336M       150M       9.8G
8 cores of AMD Opteron(tm) Processor 4171 HE

I'm using the locust load testing tool, which spawns individual processes that repeatedly make requests against a few server endpoints. Our load testing suite will be made more complete, but I wanted to share what I've learned so far.

First, Apache must be explicitly told to run the WSGIDaemonProcess with a single thread. When running with the defaults, concurrent requests to the reads API would throw errors like this one:

[Tue Apr 19 21:09:01.459466 2016] [:error] [pid 9804:tid 140138183067392]     for obj, nextPageToken in objectGenerator(request):
[Tue Apr 19 21:09:01.459471 2016] [:error] [pid 9804:tid 140138183067392]   File "/srv/ga4gh/server/newserverenv/lib/python2.7/site-packages/ga4gh-0.2.3.dev4_nge0b07f3-py2.7.e
gg/ga4gh/backend.py", line 520, in readsGenerator
[Tue Apr 19 21:09:01.459484 2016] [:error] [pid 9804:tid 140138183067392]     return self._readsGeneratorSingle(request)
[Tue Apr 19 21:09:01.459489 2016] [:error] [pid 9804:tid 140138183067392]   File "/srv/ga4gh/server/newserverenv/lib/python2.7/site-packages/ga4gh-0.2.3.dev4_nge0b07f3-py2.7.egg/ga4gh/backend.py", line 536, in _readsGeneratorSingle
[Tue Apr 19 21:09:01.459494 2016] [:error] [pid 9804:tid 140138183067392]     request, readGroup, reference)
[Tue Apr 19 21:09:01.459499 2016] [:error] [pid 9804:tid 140138183067392]   File "/srv/ga4gh/server/newserverenv/lib/python2.7/site-packages/ga4gh-0.2.3.dev4_nge0b07f3-py2.7.egg/ga4gh/backend.py", line 157, in __init__
[Tue Apr 19 21:09:01.459514 2016] [:error] [pid 9804:tid 140138183067392]     super(ReadsIntervalIterator, self).__init__(request, parentContainer)
[Tue Apr 19 21:09:01.459519 2016] [:error] [pid 9804:tid 140138183067392]   File "/srv/ga4gh/server/newserverenv/lib/python2.7/site-packages/ga4gh-0.2.3.dev4_nge0b07f3-py2.7.egg/ga4gh/backend.py", line 69, in __init__
[Tue Apr 19 21:09:01.459525 2016] [:error] [pid 9804:tid 140138183067392]     self._initialiseIteration()
[Tue Apr 19 21:09:01.459530 2016] [:error] [pid 9804:tid 140138183067392]   File "/srv/ga4gh/server/newserverenv/lib/python2.7/site-packages/ga4gh-0.2.3.dev4_nge0b07f3-py2.7.egg/ga4gh/backend.py", line 84, in _initialiseIteration
[Tue Apr 19 21:09:01.459535 2016] [:error] [pid 9804:tid 140138183067392]     self._nextObject = next(self._searchIterator, None)
[Tue Apr 19 21:09:01.459548 2016] [:error] [pid 9804:tid 140138183067392]   File "/srv/ga4gh/server/newserverenv/lib/python2.7/site-packages/ga4gh-0.2.3.dev4_nge0b07f3-py2.7.egg/ga4gh/datamodel/reads.py", line 111, in _getReadAlignments
[Tue Apr 19 21:09:01.459553 2016] [:error] [pid 9804:tid 140138183067392]     for readAlignment in readAlignments:
[Tue Apr 19 21:09:01.459558 2016] [:error] [pid 9804:tid 140138183067392]   File "pysam/calignmentfile.pyx", line 1758, in pysam.calignmentfile.IteratorRowRegion.__next__ (pysam/calignmentfile.c:19488)
[Tue Apr 19 21:09:01.459594 2016] [:error] [pid 9804:tid 140138183067392] IOError: truncated file

Note the IOError: truncated file, I believe that pysam attempts to reuse the same filehandle across threads causing .fetch to have unpredictable results. Worst of all, the file can get into an unrecoverable state causing all future requests for data in that BAM to throw a 500.

I was able to reliably reproduce this error with a simpler setup: spawn a few processes each sending a single search/reads request against the ref server running as a WSGI process. When running using the python HTTP server, concurrent requests are queued and handled one at a time and this error will not be reproduced.

I then changed the WSGIDaemonProcess threads to one and set the processes to 10 in the site.conf.

WSGIDaemonProcess processes=10 threads=1

With this setting, load testing could continue, and I was overall very pleased with the results. The load testing framework simulates user action by repeatedly making HTTP requests against a few endpoints. I've simulated 200 users maxing out at around 30 requests/second. At this stage I've performed testing of the endpoints listed below.

Request time summary statistics

Method	Name	# requests	# failures	Median resp time	Avg resp time	Min resp time	Max resp time	Avg content size	Req / s
POST	"/datasets/search"	4254	2	9	14	4	1030	96	0.07
POST	"/features/search"	426379	57	20	26	5	31123	33457	7.04
POST	/featuresets/search"	424892	54	9	14	4	30212	201	7.01
POST	"/readgroupsets/search"	426304	46	260	274	154	29388	6432351	7.03
POST	"/reads/search"	425657	49	47	51	5	26744	163603	7.02
	Total		208	38	91	4	31123	1655139	28.18

Worth noting: less than 1% error, average response time appears to scale with the size of the payload, which we expect given we are not performing any real computation at these endpoints. This is great!

As we continue testing I will share more on what is needed to reproduce these results. The locustfile used to generate the swarm is available here.

jeromekelleher commented 8 years ago

@david4096, this is just great! This is incredibly useful data, and it's massively encouraging to see the server behaving well under load. I'd love to see more stats on this, and particularly be fascinating to see what happens when we do a distributed swarm (locust looks great by the way, thanks for introducing me!).

Regarding the errors, these look like someone playing with the Swagger front end to me, and not part of the load test. The second one in particular, looks very much like someone sent the default swagger request for searchFeatures (this causes a 500 which is incorrect; I've opened and issue on this in #1135). Is there a client string in these logs? That might help narrow it down.

I'm not sure why the first one resulted in a stack trace; that should be a 400 InvalidJsonException, which is handled as a normal input error. We should look at our error handling code to make sure we don't pollute the Apache log with this.

david4096 commented 8 years ago

@jeromekelleher I believe you are correct, someone was using the swagger docs during the night! Future load tests will be against a private server and performed in a more coordinated fashion across nodes so this kind of confusion won't happen.

I too was very encouraged by our performance without tuning!

I believe we modified the logging code so any exceptions during a request will be logged, not just 500s.

jeromekelleher commented 8 years ago

Some follow up questions:

What are the units for the response times?
Do you have more information on the failures section? This should really be zero, I can't think of any good reason for a datasets/search to fail.

Keep the data coming!

kozbo commented 8 years ago

All times are in ms. I think that the failures are due to http requests munging due to overload, though I will let @david4096 postulate on that. We are currently working on a more complete performance test suite. A more extensive load test will follow once we have that framework in place.

jeromekelleher commented 8 years ago

Thanks @kozbo, that sounds good. We should try to build in some more reporting about the nature of these errors. I think we need more information on what's happening here.

ga4gh / ga4gh-server

Load testing #1134

Server (latest source):

Client

Request time summary statistics