HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

DNs can get stuck in WAITING if there are pending writes that have not been processed while k8s is scaling #104

Closed jananzhu closed 1 year ago

jananzhu commented 2 years ago

This was found with HSDS running on a Kubernetes cluster:

Steps to reproduce:

  1. Initialize HSDS deployment with a single DN.
  2. Make multiple simultaneous write requests(We tested with 4 hsload commands on a 200MB file)
  3. Scale the deployment up
  4. The hsload commands eventually time out since the DN is stuck in WAITING.

If node number of the initial DN changes during scaling and the cache is dirty, then the DN changes its status to WAITING and sleeps until the cache is no longer dirty. However, there may not be pending write requests in flight for every dirty chunk if the number of dirty chunks exceeds the configured max_pending_write_requests. If this is the case, the DN gets stuck in WAITING after the in-flight write requests complete - further runs of s3syncCheck just go back to sleep and there are no tasks scheduled to write the remaining dirty chunks to S3.

Not sure what the best solution would be but it seems like we'd want some kind of SCALING status that would indicate that the DN is unavailable for new write requests, but that would allow it to continue running pending writes from before the change in node number.

jananzhu commented 2 years ago

We've tried simply removing the sleep on WAITING status in s3syncCheck but downstream write code errors out since the DN state is WAITING, which is why we think a new status might be needed.

jreadey commented 2 years ago

Yes, I expect a new wait state will be needed. I'll take a look at it. If you can just suspend the writers until the scale is complete that should work of course.

If your main goal is running multiple hsload's, take a look at the bucket loader repo: https://github.com/HDFGroup/hsds-bucket-loader. The idea here is that each pod can be more or less self-contained and you don't need to worry about scaling hsds relative to the number of load jobs.

jananzhu commented 2 years ago

Well, the goal for us is to be able to use a horizontal pod autoscaler to scale the HSDS cluster in response to heavy load and with this issue, scaling while there is any write load on the cluster can cause it to enter a unrecoverable state - the deployment needs to be restarted to get rid of the DN that's stuck in WAITING.

jreadey commented 2 years ago

Sounds interesting - I'd be curious to learn more about your use case if you'd like to contact me via email.

bilalshaikh42 commented 2 years ago

We just ran into this exact issue. Our use case is that we can have bursts of activity when simulations are submitted in a batch such as when importing from repositories. For these times, we would want to increase the number of replicas based on some metric such as the number of 503s being returned by HSDS rather than having enough capacity at al times

jreadey commented 2 years ago

@bilalshaikh42 - yes, I think it will need some work to have a robust solution for the autoscaling issue. Likely will not be able to address in the upcoming release.

@jananzhu was able to to work around this by using "sidecar" deployments. The idea is to have a loader pod that has a container that loads the file + sn and dn containers. I.e. a mini-HSDS in each pod. This way as you scale up the loader pods you automatically scale the service since the loader container is just sending requests within the pod.

This was the approach used in the bucket loader project: https://github.com/HDFGroup/hsds-bucket-loader. There's a global HSDS that has a table that is used to coordinate which pod will be loading each file. Take a look and let me know if that setup might work for you as well.

bilalshaikh42 commented 2 years ago

@jreadey Would you be able to help find the locations in the code that are relevant to this issue? I assume any change made to the Kubernetes states, would also need to be reflected on the other deployment methods? Or is this issue unique to k8s? So far, I have found the following relevant places:

https://github.com/HDFGroup/hsds/blob/1e7e36c5f43af62595181d0222ed3008b725cb46/hsds/basenode.py#L290 https://github.com/HDFGroup/hsds/blob/1e7e36c5f43af62595181d0222ed3008b725cb46/hsds/datanode.py#L113 https://github.com/HDFGroup/hsds/blob/6e54321ef19679a01389e6816f72053227d1c9b1/hsds/datanode_lib.py#L877

We are running into the issue sometimes during high load when the other applications in the cluster cause one of the hsds pods to get evicted, (definitely not a use-case I would expect to be supported), so I figured I would take a stab at it

jreadey commented 2 years ago

Does you k8s deployment have a prestop notification? Like this: https://github.com/HDFGroup/hsds/blob/fancyindx/admin/kubernetes/k8s_deployment_aws.yml#L58.

And is that getting through to the containers? There's a preStop function in basenode.py. That should set the node_state to TERMINATING.

In any case line 877 in datanode_lib.py should be changed so pending writes can be drained when a prestop is received. Actually I think it would be find just to take the if check out. I took it out just now - let me know if that casing any untoward effects.

If that works out, are you ok with requests getting 503 when the cluster state is not READY? In a scaling event the nodes should eventually synch up after a bit and state should go back to READY.

Code for the check is here: https://github.com/HDFGroup/hsds/blob/master/hsds/hsds_logger.py#L130. I know hsds_logger is not where you'd expect to find this, but that was the easiest place to put a check for any request processing.

bilalshaikh42 commented 2 years ago

Yup, the 503 is fine, since we are using hsload and retires, so that should be good. Is there a docker image that includes the relevant change?

bilalshaikh42 commented 2 years ago

@jreadey would you mind releasing this to dockerhub? The latest I see is beta11 which does not include this change

jreadey commented 2 years ago

I've added a build based on the current master branch to dockerhub: hdfgroup/hsds:v0.7beta12. Let me know if this addresses the stuck in waiting issue.

bilalshaikh42 commented 2 years ago

Thank you!

So looking at the logs of the HSDS, it seems like the deployment is able to recover if one of the pods fails, or if pods are added. But the hsload commands is failing with "Too many 503 error responses" so I am wondering if the pending writes are not actually being written even though the pod is not in the waiting state. It seems @jananzhu did try a similar approach so maybe this is the same issue?

We've tried simply removing the sleep on WAITING status in s3syncCheck but downstream write code errors out since the DN state is WAITING, which is why we think a new status might be needed.

bilalshaikh42 commented 2 years ago

Correcton, after setting the logs to include warnings, it seems that the datanode is getting stuck in the waiting state still.

REQ> GET: /info [10.20.6.53:6101]
REQ> GET: /domains [biosim-hsds-dev/results/62336067451d4d4e2a8601ba]
WARN> returning 503 - node_state: WAITING
REQ> GET: /info [10.20.2.24:6101]
jreadey commented 2 years ago

Ok - I'll see if I can reproduce. As a temp hack, if you set the pod count to 0, wait for the pods to exit, and then set to scale to desired value, that should work.

To avoid hsload dying, there's a --retries option to hsload that you can use to adjust the number of retry attempts. Default is 3, setting to 10 or so should give enough time for the server to recover.

Re: pending writes... the idea of the preStop handler was to give the DN containers time to flush their queue before the pod shuts down. Are these events getting through?

bilalshaikh42 commented 2 years ago

As a temp hack, if you set the pod count to 0, wait for the pods to exit, and then set to scale to desired value, that should work.

Yup, we have been doing this by just restarting the whole thing.

Re: pending writes... the idea of the preStop handler was to give the DN containers time to flush their queue before the pod shuts down. Are these events getting through?

I can try to see the logs while its terminating, but is there a better way to determine this?

jreadey commented 2 years ago

I'm able to reproduce this be scaling up/down the pods while under load. This update: https://github.com/HDFGroup/hsds/commit/57166b7cdbf45c577abdfdcbbe224e4cb2637677, seems to fix it, at least in my test scenario. I wonder if it might help with biosimulations/biosimulations#133 as well... I was seeing writer errors causing the s3sync task to fail, causing the server not to go back into READY after a scaling event, etc.

BTW - I was using this: https://github.com/HDFGroup/hsds/tree/master/tests/perf/write to do dataset writes from multiple pods. Maybe it will be helpful for testing.

I put a container image in docker hub: hdfgroup/hsds:57166b7.

Let me know how it works!

jreadey commented 2 years ago

@bilalshaikh42 and/or @jananzhu - if you could report back on whether image hdfgroup/hsds:36a7c61 helps with this, that would be great. I'm hoping get release 0.7 out soon and would like to cross this off the list of known bugs.

bilalshaikh42 commented 2 years ago

Will do. Working on this in biosimulations/deployment#69

jreadey commented 2 years ago

comment on https://github.com/biosimulations/deployment/issues/69 regarding node scaling when there are pending write tasks: In most cases the on_shutdown function in datanode.py should prevent any new write requests from being accepted and flush existing writes. Kubernetes is suppose to give at least 2 seconds for the pod to clean up its business, which I think should be sufficient in most cases.

But you need be sure in your deployment yaml you are telling kubernetes what to do when a pod is terminated. At some point the hsds docker build stopped including curl in the image, so the example yamls weren't doing the right thing. I changed the examples to use /usr/sbin/killall5 rather than /usr/sbin/curl (see: https://github.com/HDFGroup/hsds/commit/0248bdd627ad7d222233e4681f877b9faba91e3c) and can see the on_shutdown function being called at termination.

bilalshaikh42 commented 2 years ago

This does seem to have partially resolved the problem! If I start a high load, then kill a pod, the service starts returning errors but then recalibrates and starts working again! 😄

However, it seems that hsload is not able to handle this. previously, when the service was stuck, it would retry. However, now it seems the upload simply fails. Here are the logs from hsload

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/h5pyd/_hl/group.py", line 138, in _get_link_json
    rsp_json = self.GET(req)
  File "/usr/local/lib/python3.6/site-packages/h5pyd/_hl/base.py", line 893, in GET
    raise IOError(rsp.status_code, rsp.reason)
OSError: [Errno 404] Not Found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "/home/FCAM/crbmapi/.local/lib/python3.6/site-packages/h5py/_hl/group.py", line 591, in proxy
    return func(name, self[name])
  File "/usr/local/lib/python3.6/site-packages/h5pyd/_apps/utillib.py", line 666, in copy_attribute_helper
    tgt = fout[name]
  File "/usr/local/lib/python3.6/site-packages/h5pyd/_hl/group.py", line 605, in __getitem__
    parent_uuid, link_json = self._get_link_json(name)
  File "/usr/local/lib/python3.6/site-packages/h5pyd/_hl/group.py", line 140, in _get_link_json
    raise KeyError("Unable to open object (Component not found)")
KeyError: 'Unable to open object (Component not found)'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/hsload", line 33, in 
    sys.exit(load_entry_point('h5pyd==0.8.4', 'console_scripts', 'hsload')())
  File "/usr/local/lib/python3.6/site-packages/h5pyd/_apps/hsload.py", line 314, in main
    load_file(fin, fout, verbose=verbose, dataload=dataload, s3path=s3path, compression=compression, compression_opts=compression_opts)
  File "/usr/local/lib/python3.6/site-packages/h5pyd/_apps/utillib.py", line 718, in load_file
    fin.visititems(copy_attribute_helper)
  File "/home/FCAM/crbmapi/.local/lib/python3.6/site-packages/h5py/_hl/group.py", line 592, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phi
jreadey commented 2 years ago

Do you see the on_shutdown running in the killed pod?

I suspect that the loader is creating a link, the pod 200's that, and then dies before flushing the data. The loader then is confused that the link it just created is not found.

bilalshaikh42 commented 2 years ago

That sounds reasonable. I am not sure what a good way to confirm the on_shutdown running would be. But when I send the kill signal, the pod moves to the terminating state for 30s, which is the default grace period. I see this in the logs. Not sure why the timeout error would specifically happen at this point. But I think it ends up getting killed before it can flush everything. I'll increase the grace period and see if that helps.


REQ> PUT: /datasets/d-5bff5ea5-c5b5f18a-592b-bd3ecd-062f65/attributes/sedmlDataSetShapes [/results/62866a24cc0c14290dd5bbb6]
29
WARN> ClientError: Cannot connect to host 10.20.2.41:6101 ssl:default [Connect call failed ('10.20.2.41', 6101)]
28
REQ> GET: /groups/g-fa13d7b3-079bb285-729b-5f3b8f-133a0d/links/simulation.sedml [/results/62866a5284e6fa648e58a423]
27
WARN> ClientError: Cannot connect to host 10.20.2.41:6101 ssl:default [Connect call failed ('10.20.2.41', 6101)]
26
REQ> GET: /datasets/d-3049ce91-50eb9aa3-53ff-d0450d-d777ac [/results/62866a43cc0c14290dd5bd72]
25
REQ> GET: /datasets/d-c3155a16-c7c15aa2-4fa9-cafef4-38191f [/results/62866a4284e6fa648e58a2db]
24
WARN> ClientError: Cannot connect to host 10.20.2.41:6101 ssl:default [Connect call failed ('10.20.2.41', 6101)]
23
REQ> PUT: /datasets/d-971f6a9e-61ee041a-f4e6-58f447-2b465e/attributes/uri [/results/62866a45ada7e3df9fa34ae1]
22
REQ> POST: /datasets [/results/62866a4bada7e3df9fa34ba1]
21
REQ> GET: /groups/g-e9db5f4f-27323a46-6153-d7c7af-bab2ce/links/simulation.sedml [/results/62866a4784e6fa648e58a339]
20
REQ> GET: /groups/g-fdf1ca83-2d0a9288-da2f-621a17-a31e2f/links/report_del_Cln2 [/results/62866a2984e6fa648e58a169]
19
REQ> GET: /groups/g-971f6a9e-61ee041a-b08d-757a74-1f96a2/links/Figure_1_bottom_left [/results/62866a45ada7e3df9fa34ae1]
18
REQ> GET: /datasets/d-546809ae-ae143038-6324-795071-b45703 [/results/62866a41ada7e3df9fa34a95]
17
REQ> GET: /groups/g-3049ce91-50eb9aa3-3599-2dde92-2eecb6 [/results/62866a43cc0c14290dd5bd72]
16
WARN> ClientError: Cannot connect to host 10.20.2.41:6101 ssl:default [Connect call failed ('10.20.2.41', 6101)]
15
REQ> PUT: /datasets/d-e9db5f4f-27323a46-a7e0-e6f667-99e2eb/value [/results/62866a4784e6fa648e58a339]
14
REQ> GET: /datasets/d-792cd0ab-b848113e-5e62-cda1f4-892792 [/results/62866a54ada7e3df9fa34c52]
13
REQ> PUT: /groups/g-e2a75718-1c29a295-b2bc-a7674b-f9b4bc/links/report [/results/62866a1b84e6fa648e58a10c]
12
WARN> ClientError: Cannot connect to host 10.20.2.41:6101 ssl:default [Connect call failed ('10.20.2.41', 6101)]
11
Error handling request
10
ERROR> Unexpected exception <class 'asyncio.exceptions.TimeoutError'> for write_chunk_hyperslab(c-747d7552-6fc933ef-e700-f6e03d-65cd46_0_0):
9
WARN> ChunkCrawler.doWork - retry: 1, sleeping for 0.27
8
Error handling request
7
Error handling request
6
Error handling request
5
WARN> POST request error for url: http://10.20.3.117:6101/datasets status: 500
4
WARN> ClientError: Cannot connect to host 10.20.2.41:6101 ssl:default [Connect call failed ('10.20.2.41', 6101)]
3
ERROR> Exception: Internal Server Error from /info request
2
WARN> ClientError: Cannot connect to host 10.20.2.41:6101 ssl:default [Connect call failed ('10.20.2.41', 6101)]
1
ERROR> Exception: Internal Server Error from /info request
bilalshaikh42 commented 2 years ago

Our main issue does seem to be resolved though since the service overall does recover. Previously in this situation, nothing would work until a manual restart of everything. It does end up recovering now, even though some data is lost during the time the pod is down.

jreadey commented 2 years ago

There's an info log in the DN container: https://github.com/HDFGroup/hsds/blob/master/hsds/datanode.py#L320. I saw this show up in the logs when I scaled down the pods. Does your yaml have the command?

jreadey commented 2 years ago

BTW, I think the getPodsIP function in https://github.com/HDFGroup/hsds/blob/master/hsds/util/k8sClient.py needs to be enhanced to be aware of the pod state. As is, it looks like pod ips are being returned even when the pod is in a terminating state. This will cause a longer time for HSDS being in a non-responsive state since requests from the SN's (in active pods) to DNs (in terminating pods) will timeout.

Similarly, ips from pods in Pending state should be ignored (a pod could be pending indefinitely if there are not resources on the cluster to run it).

bilalshaikh42 commented 2 years ago

Ah, I didn't have logging set low enough. Yup, I see the method being called. Seems to be working fine

REQ> POST: /roots/g-c31069d6-1e5edf83-4b98-e15e96-d6570b [10.20.0.44:6101]
WARN> returning 503 - node_state: TERMINATING
REQ> POST: /roots/g-c31069d6-1e5edf83-4b98-e15e96-d6570b [10.20.0.44:6101]
WARN> returning 503 - node_state: TERMINATING
INFO> releasing http clients
INFO> on_shutdown - done
INFO> datanode exiting
jreadey commented 2 years ago

That leaves me confused as to why hsload would die. There's a warn log is a shutdown is received with pending write tasks, and we don't see that. If there are pending writes, the on_shutdown handler will sleep till they get flushed.

Not that the DN is properly returning 503's once the shutdown is handled, so no new writes should be accepted.

I'll do some experimenting on my k8s and see if I can repro.

jreadey commented 1 year ago

I'm closing this issue as it looks to be at least partially resolved. There's likely some remaining edge cases, but lets handle those in separate issues.