HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

Clarification of node types for HSDS #120

Closed bilalshaikh42 closed 2 years ago

bilalshaikh42 commented 2 years ago

Hello,

I am trying to run HSDS locally and running into some issues regarding the various node types such as the head nodes, data nodes, and service nodes. In our Kubernetes deployments, it seems that head nodes are no longer used, but the local data nodes complain that they are not accessible.

Could you please clarify the role of the various nodes and how they communicate with each other and external users?

jreadey commented 2 years ago

Sure - the different node types and variation between docker, local, and Kubernetes deployments can be confusing.

For Docker there are these types of nodes (containers):

head_node: singleton, sends health checks to other containers, maintains the health state of the service (READY vs WAITING) service_node: handles client requests, dispatches to one or more data nodes data_node: receives request from service node, reads/write data to storage object within its partition rangeget_node: Singleton used by DN node to retrieve/cache chunk data from HDF5 files (does read-ahead for efficiency)

For Kubernetes, there is no head node or rangeget_node. Instead each pod consists of a SN and DN container. The pod uses the Kubernetes api to determine which other HSDS pods are running on the system. Client requests get routed to SN container which dispatches to either the DN container in the pod or to DN container in other pods.

On Kubernetes you can also do deployments to a single pod (e.g. https://github.com/HDFGroup/hsds/blob/master/admin/kubernetes/k8s_deployment_aws_singleton.yml) where each pod has 1 head, 1 sn, and multiple DN containers. Here the pods don't communicate with each other which can lead to race conditions for multiple writer/multiple reader applications (but is fine with all readers).

For local deployments (or AWS Lambda invocations), there is no head containers (or any containers at all). Instead the parent process creates sub-processes for SN, 1 or more DNs, and a rangeget process. Communication between client, and sub-processes is via sockets rather than HTTP/TCP.

Does this explanation help? Let me know if anything still seems murky.

For development work, I generally use the local mode on my notebook. Start the server with: ./runall.sh --no-docker. Log file will show up in /tmp/hs/ by default.

bilalshaikh42 commented 2 years ago

This is very helpful! The difference between the local, Kubernetes and singleton Kubernetes was really throwing me off previously, but this is now clear. On Kubernetes, I assume the service node must be checking the status of known data nodes to determine if it can pass on a request correctly, or not? If it finds that all the data nodes are busy, does it change its own status so that the k8s load balancers detect this? Or does it return a 429 etc?

jreadey commented 2 years ago

Glad that was helpful!

The service node is checking the state of all data nodes with the intent is to establish of cardinal ordering of the running data nodes. If you look at this diagram: https://www.hdfgroup.org/wp-content/uploads/2020/01/Kita-Architecture-v2-RGB.png, you can see that each data node belongs to one partition of the object store. The idea is that each storage object has one and only data node "master" -- that is the only entity reading & writing to that object. If multiple data nodes were writing to the same object it's very. likely that updates would get overwritten or at least you wouldn't have read after write consistency. Another advantage is that the data node can cache of recently accessed objects without worrying that the data has been changed in the storage layer.

So now imagine that we add more data nodes into the mix. As these come online the system will need to update the number of partitions and re-map which objects belong to which data node. Until all data nodes become aware of their new order and the write queues are flushed, the service nodes won't accept new requests (will respond with 503 errors).

At the Kubernetes level, the service node container will be aware of a new pod once it shows up in the getPodIPs request. At that point it will be returning 503s until it pings each data node and sees that each data node has the correct partition number (this will generally take about a second).

429 errors aren't directly related - they get sent by the status node when there are too many in-flight requests (service is too busy). Solution here is too add more pods which will increase the capacity of the service.

As you can see, lots of moving parts, and there's still some work to do to enable HSDS to reliable scale up and down to match the work load.

TManhente commented 2 years ago

I apologise in advance for commenting on an already closed issue, but I was searching for clarification on the nodes in a Kubernetes deployment and although the comments here were very helpful, there is one thing that is still unclear to me: I can see that in a Kubernetes implementation the head node can be replaced by the Kubernetes API, but why exactly the rangeget node is also not needed? Wouldn't the read-ahead be useful or needed when running HSDS in Kubernetes?

jreadey commented 2 years ago

Hi @TManhente - no problem!

Some background... For data stored in the native HSDS format, each chunk is stored as a separate object and are typically 2-8 MBs in size so are relatively efficient to read into memory as needed. On the other hand, if HSDS is reading data out of an HDF5 file with a small chunk size (e.g. it's not uncommon to find HDF5 files with 4KB or smaller chunks) having to do all those small reads can slow things down (particular for data store in S3 or Azure blobs).

So the rangeget proxy was introduced to speed things up in this later case. ("rangeget" refers to reading a byte range rather than the entire object). When bytes need to be read, the rangeget proxy gets more bytes than requested on the assumption that data close the original request are likely to be needed soon. This larger range is stored in memory (in a LRU cache) and if future requests can be served from the cache, the additional storage read can be avoided and the data fetched from memory. In practice, it's quite common to have HDF5 files where chunks that are logically adjacent are stored sequentially in the file space, so often you'll get a good hit rate on the cache and that will speed things up. Exactly how much is hard to say since it depends on the chunk layout of the file, the applications access patterns, how relatively slower storage access is, etc.

Unlike the SN or DN nodes, it's not useful to have more than one rangeget node per machine. One node will maximize the chances of a cache hit and rangeget proxy isn't doing anything that's CPU intensive. For Docker that's easy enough to setup -- the deployment yaml will have one rangeget container and the DN containers will know to send rangeget requests to a known port on localhost.

For Kubernetes, it's a big trickier. You could have one rangeget container per pod, but then data cached on one pod wouldn't be accessible to other pods, and the multiple rangeget caches would end up using a lot of memory. Likely the best approach would be to have the rangeget proxy run in a "Kubernetes Daemonset" (https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/). In this way you could run one rangeget node per cluster node and HSDS pods would always be able to access a rangeget proxy on the same machine. This would make the deployment a bit more complicated, and for now it seemed best to just leave rangeget as a Docker-only feature. Note there won't be any functional difference running with the proxy or not, it will just be a performance difference, if that.

If you think your application might benefit from using the rangeget proxy, I'd suggest you do some benchmarking for your use case with and without the proxy running in Docker. If you get a large speed up with the proxy, it might be worthwhile to setup the proxy to run in Kubernetes and see how it performs. I'd be happy to advise on how to do that.