Setup for throughput optimization ?

HDFGroup / hsds

Cloud-native, service based access to HDF data

Apache License 2.0

129 stars 53 forks source link

I've set up an instance of HSDS on EC2 running in docker containers as per these instructions. The S3 bucket is the NREL bucket like so:

BUCKET_NAME=nrel-pds-hsds
AWS_REGION=us-west-2
AWS_S3_GATEWAY=http://s3.us-west-2.amazonaws.com

I'm running some performance tests to see how much throughput I can squeeze out of HSDS. I've got it running on a m5a.12xlarge (48 cores, 192GB) and I started it with 44 nodes using ./runall.sh 44. Running docker ps confirms that all the expected containers are running. I set up a load test script to hit HSDS with concurrent requests simulating multiple users. As I scale up the load HSDS starts returning 503 errors. What's interesting is that through monitoring the server performance during the test I can see that the CPU and memory are barely being touched. I suspect I might be running docker in a way that is preventing the EC2 instance from really doing it's thing, or some other similar setup issue.

with 1200 total requests at a concurrency rate of 120 HSDS requests at a time my tests result in 97% request failure, all with 503 responses
with 200 total requests at a concurrency rate of 40 HSDS requests at a time my tests result in 63% failure, all with 503 responses
with 200 total requests at a concurrency rate of 20 HSDS requests at a time my tests result in 3% failure, all with 503 responses

Considering that minimal level of detail, are there any suggestions that spring to mind for me to try to either improve results, or research where the bottleneck is happening?

I've been working directly with @PjEdwards on this issue, so I will just summarize the results of our investigation here.

The server returns 503 ("server busy") when there are more than 100 inflight requests per node. The idea is that the client will throttle back when it sees this so that the workload on the clients and server can be roughly balanced.

In the particular tests that were run, each client request to the SN node (for a hyperslab selection) generated 100's of SN requests to DN nodes (each touched chunk resulted in a additional SN to DN request) so it was likely to hit this limit with a modest number of client requests.

Regarding CPU usage, in my testing I saw high cpu utilization per node which indicates that the limits on inflight requests look about right. You can run: $ docker stats $(docker ps --format={{.Names}}) to see the container CPU utilization.

To improve throughput there are a few options. Best approach will depend on the particulars of the client and hardware set:

Reduce request rate when the 503 errors reach a certain threshold. Going past a small percentage of 503 errors will result in a large amount of wasted effort on the server
Move to a larger instance and run HSDS with more nodes
Run multiple HSDS servers behind a load balancer
Run multiple servers in a Kubernetes cluster

Also, for AWS deployments I'll be adding support for using AWS Lambda rather than using the DN nodes (for read requests). Lambda allows higher amount of parallelism (up to a 1000 requests) than would be practical with just SN and DN nodes. Stay tuned!

HDFGroup / hsds

Setup for throughput optimization ? #17