HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

Performance Debugging #130

Closed bilalshaikh42 closed 2 years ago

bilalshaikh42 commented 2 years ago

Hello, We have been having performance issues when uploading multiple HDF 5 files at once and are trying to debug. Here is what the HSDS responses look like

image

To me, this is pretty clearly HSDS working fine, exceeding the max inflight requests, returning 503s, and then accepting requests again when the cache is cleared. We have quite a few replicas running on Kubernetes, so I am suspecting that we might not be giving enough resources to the pods, or that somehow the connection to the bucket is slow.

@jreadey do you have any ballpark numbers for the number of requests that a replica should be able to handle? Are there recommended/minimum resources for each pod?

jreadey commented 2 years ago

The number of requests per replica is highly dependent on the type of request that is coming through. One hyperslab request to a SN node can trigger multiple requests to other pod's DN containers, and that all contributes to the max_task_count limit. Once that limit is reached the pod will return 503's until some pending requests are completed.

You can try raising the max_task_count config and see how that goes. If you get close to 100% CPU you are probably doing as good as you can with the max_task_count limit. Keep an eye out for OutOfMemory errors and raise the memory limit if that shows up.

How does throughput change as you increase the number of replicas? Ideal would be linear performance gain, but there may be other factors that prevent that. Can you track total network I/O and storage I/O on the cluster? Once those are saturated that will obviously be the bottleneck.

For HDF5 file uploads, are you converting them to the HSDS schema or just linking to the HDF5 files using hsload with the --link option (metadata is extracted but chunks stay with the file)? The later will be much less compute intensive. @ajelenak created a Docker image that uses the latest HDF5 lib release to improve performance for getting the chunk location info in an HDF5 file. See: https://github.com/HDFGroup/hdf-docker/tree/master/storage-info.

bilalshaikh42 commented 2 years ago

You can try raising the max_task_count config and see how that goes. If you get close to 100% CPU you are probably doing as good as you can with the max_task_count limit. Keep an eye out for OutOfMemory errors and raise the memory limit if that shows up.

This might be the approach we take. These errors would show up in the logs? Does the memory limit here mean that limit that is defined in the config? Or the memory limit set for the pod via kubernetes?

Can you track total network I/O and storage I/O on the cluster? Once those are saturated that will obviously be the bottleneck.

Not sure how to do that exactly, but I'll look into it. Everything is running on google cloud and the storage buckets are as well, so I suspect the throughput should be fine. But will definitely check.

For HDF5 file uploads, are you converting them to the HSDS schema or just linking to the HDF5 files using hsload with the --link option (metadata is extracted but chunks stay with the file)? The later will be much less compute intensive. @ajelenak created a Docker image that uses the latest HDF5 lib release to improve performance for getting the chunk location info in an HDF5 file. See: https://github.com/HDFGroup/hdf-docker/tree/master/storage-info.

We are using hsload without the link parameter, so I guess we are converting. If we use the --link option, does that still allow for the HSDS to return the values as JSON when queried later? What are the downsides, if any, of using it?

jreadey commented 2 years ago

The OOM errors don't show up in the HSDS log since the container itself is being killed. If you do a describe pod it will show when it was terminated due to using up the available memory.

There's the memory limit that you are probably specifying in the deployment yaml. Then there's memory related config setting in HSDS, such as "chunk_mem_cache_size". You don't want the Kubernetes memory limit to be less than chunk_mem_cache_size + metadata_mem_cache_size + working memory! If you make changes to any of these, verify with a realistic load that you are not seeing OOM errors.

Re: the link option for hsload - it will not have any effect on how the data can be queried. The major constraint is that data set that is loaded with --link can't be modified. I.e. the source HDF5 files will never be updated.

The other constraint is that if datasets in your HDF5 files use a small chunk size, you'll be stuck with that when they are loaded with --link. Without the --link option, hsload will use the opportunity to rechunk the datasets to use a chunk size between 2-4mb if possible. Smaller chunk size are a bit inefficient when using cloud storage since the latency to access a chunk is fairly high. I'd recommend trying out the two options to and evaluating how it effects the query performance.

bilalshaikh42 commented 2 years ago

The OOM errors don't show up in the HSDS log since the container itself is being killed. If you do a describe pod it will show when it was terminated due to using up the available memory.

Oh, sorry I had misunderstood. Yes, the memory is fine in this case.

It does seem that CPU usage was the bottleneck. I was focused on tweaking the memory settings and completely overlooked monitoring the CPU usage.

Disabling the HTTP compression, (the service is deployed behind a Nginx load balancer that does this already) seemed to help. I think by increasing the CPU and using --link, we should be good! Thank you as always for help!

jreadey commented 2 years ago

Yes, I changed the default http compression in https://github.com/HDFGroup/hsds/blob/master/admin/config/config.yml to False. If the workload is sending lot's of easily compressible data data to/from the server it might help, but in general it seems not.