HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
127 stars 52 forks source link

HSDS Pods errors #133

Closed bilalshaikh42 closed 2 years ago

bilalshaikh42 commented 2 years ago

Hello, I have seen the following errors pop up on the error reporting. They seem to all be related. I am trying to figure out what could be causing these. I suspect some connection to the bucket failed, but rather than handling it gracefully, one of the pods may have crashed. This then causes the lockup issue described in #104. I am not sure if this is the sequence of events but seems like it based on observing the requests ( most recent error last).

 Traceback (most recent call last):
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/datanode.py", line 147, in bucketScan
    await scanRoot(app, root_id, update=True, bucket=bucket)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/async_lib.py", line 436, in scanRoot
    await putStorJSONObj(app, info_key, results, bucket=bucket)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/util/storUtil.py", line 393, in putStorJSONObj
    rsp = await client.put_object(key, data, bucket=bucket)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/util/s3Client.py", line 357, in put_object
    raise HTTPInternalServerError()
aiohttp.web_exceptions.HTTPInternalServerError: Internal Server Error 
 Traceback (most recent call last):
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/datanode_lib.py", line 884, in s3syncCheck
    update_count = await s3sync(app)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/datanode_lib.py", line 865, in s3sync
    await notify_root(app, root_id, bucket=bucket)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/datanode_lib.py", line 85, in notify_root
    await http_post(app, notify_req, data={}, params=params)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/util/httpUtil.py", line 298, in http_post
    async with client.post(url, **kwargs) as rsp:
  File "/opt/env/hsds/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "/opt/env/hsds/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
    await resp.start(conn)
  File "/opt/env/hsds/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 905, in start
    self._continue = None
  File "/opt/env/hsds/lib/python3.8/site-packages/aiohttp/helpers.py", line 656, in __exit__
    raise asyncio.TimeoutError from None 
 Traceback (most recent call last):
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/datanode.py", line 147, in bucketScan
    await scanRoot(app, root_id, update=True, bucket=bucket)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/async_lib.py", line 384, in scanRoot
    await getStorKeys(app, **kwargs)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/util/storUtil.py", line 473, in getStorKeys
    key_names = await client.list_keys(**kwargs)
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/util/s3Client.py", line 588, in list_keys
    raise HTTPInternalServerError()
aiohttp.web_exceptions.HTTPInternalServerError: Internal Server Error 
 Traceback (most recent call last):
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/util/s3Client.py", line 567, in list_keys
    async for page in paginator.paginate(
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/paginate.py", line 32, in __anext__
    response = await self._make_request(current_kwargs)
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/client.py", line 211, in _make_api_call
    http, parsed_response = await self._make_request(
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/client.py", line 231, in _make_request
    return await self._endpoint.make_request(operation_model, request_dict)
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/endpoint.py", line 81, in _send_request
    while await self._needs_retry(attempts, operation_model,
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/endpoint.py", line 213, in _needs_retry
    responses = await self._event_emitter.emit(
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/hooks.py", line 29, in _emit
    response = handler(**kwargs)
  File "/opt/env/hsds/lib/python3.8/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/opt/env/hsds/lib/python3.8/site-packages/botocore/retryhandler.py", line 250, in __call__
    should_retry = self._should_retry(attempt_number, response,
  File "/opt/env/hsds/lib/python3.8/site-packages/botocore/retryhandler.py", line 269, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/opt/env/hsds/lib/python3.8/site-packages/botocore/retryhandler.py", line 316, in __call__
    checker_response = checker(attempt_number, response,
  File "/opt/env/hsds/lib/python3.8/site-packages/botocore/retryhandler.py", line 222, in __call__
    return self._check_caught_exception(
  File "/opt/env/hsds/lib/python3.8/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/endpoint.py", line 147, in _do_get_response
    http_response = await self._send(request)
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/endpoint.py", line 229, in _send
    return await self.http_session.send(request)
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 224, in send
    raise HTTPClientError(error=e)
botocore.exceptions.HTTPClientError: An HTTP Client raised an unhandled exception: [Errno 32] Broken pipe 
 Traceback (most recent call last):
  File "/opt/env/hsds/lib/python3.8/site-packages/aiobotocore/httpsession.py", line 172, in send
    resp = await self._session.request(
  File "/opt/env/hsds/lib/python3.8/site-packages/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "/opt/env/hsds/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 898, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/opt/env/hsds/lib/python3.8/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 32] Broken pipe 
Traceback (most recent call last):
  File "/opt/env/hsds/lib/python3.8/site-packages/hsds/datanode.py", line 163, in bucketScan
    log.error()
TypeError: error() missing 1 required positional argument: 'msg' 
jreadey commented 2 years ago

I've never seen anything like this but I'm more often using Amazon's S3 which seems to rarely have issues.

Do these errors correlate with a particular type of usage? E.g. only under heavy load?

I put in a fix for datanode.py line 163 - that was just missing the argument for the logger.

bilalshaikh42 commented 2 years ago

Yes, this was under high load, so its possible that the networking for the host was saturated and unable to reach the bucket. There may be some missing catch statements to prevent this from bubbling up to the top and causing the application to crash. I'll try and see if I can find where that might be needed.

jreadey commented 2 years ago

Yes, likely there needs to be more robust handling for reads/writes to the storage system.

For read errors, we can just return a 500 to the client and have the client retry the request (ideally with some sort of exponential back off time). Errors during the bucketScan operation are non-critical and the server can just retry the scan after a bit. Errors for write operations need to be caught and retried by the s3sync task - it's critical to ratain the data in memory till it's successfully written.

I'll look into to adding some test scaffolding the randomly causes the storage reads/writes to fail - in the style of Netflix's chaos monkey: https://github.com/Netflix/chaosmonkey. That will make it easier to reproduce these types of scenerios.

jreadey commented 2 years ago

Hey @bilalshaikh42 I've checked in some changes to master that should make things more stable under high load. Not sure it will help your specific setup or not, but I'd be interested to hear any feedback if you can give it a try.

To simulate high load work flows, I created the test: hsds/tests/perf/write that runs a set of pods all writing to one dataset.

bilalshaikh42 commented 2 years ago

Sure, I can test this out and see if it helps! Is there a docker image with the changes available?

I have been able to resolve our issues for now by simply disabling the metadata cache and chunk cache by settings the size to 0. This works for us since we are only having heavy write loads and light read loads, but I presume this is not an ideal solution. It might give some hint to where the issues might lie, however.

jreadey commented 2 years ago

I've pushed the latest image to docker hub as: hdfgroup/hsds:ad8597f

Strange to hear about disabling the caching. I would have thought that wouldn't work - and when I tried setting the cache configs to 0, I do get an exception! (divide by zero errror)

Anyway try this image with and without the cache settings. For reading I do see quite a bit of speed up when an item is in the cache. Generally about 2x compared to when it has to retrieve from S3.

jreadey commented 2 years ago

This latest image: hdfgroup/hsds:36a7c61 might be even better!

bilalshaikh42 commented 2 years ago

This seems to be fully resolved!

jreadey commented 2 years ago

great - I'll close the issue then.