HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
129 stars 53 forks source link

Implement smooth scale downs of a DCOS cluster #51

Closed s004pmg closed 4 years ago

s004pmg commented 4 years ago

In the head node, the health check has some notion of handling the loss of a data or service node, but the logic is focused on replacing that failed node only. I believe that this can be safely expanded to also scale down from a larger to a smaller cluster, e.g. for DCOS.

jreadey commented 4 years ago

FYI, I created a "chaos monkey" config that causes any node to randomly die: https://github.com/HDFGroup/hsds/commit/1ba2d3687119e7bcfb1edcb511053e8ab22bbc80. Also a status_check tool that just make s request to /about and notes the server state is "READY" or not.

Anyway, trying this out I can occasionally see the server go to a non-recoverable state. Still need to investigate what's trigging this.

jreadey commented 4 years ago

I think this checkin might fix the problem with recovering from a failed node: https://github.com/HDFGroup/hsds/commit/91901e7e906d91889759511c7920d406cc1ce500. I ran the server in "chaos monkey" mode for several hours and never got stuck in the INITIALIZING state.

Kubernetes, will be a different matter, but I'm planning to put the head node back in the K8s deployment, so will do that first before working on scale up/scale down (and failed nodes) in K8s.

jreadey commented 4 years ago

Hopefully the checkin above resolved this issue. Please reopen if it pops up again.