HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
131 stars 53 forks source link

Improve resiliency for concurrent requests #369

Closed bengarrett33 closed 3 months ago

bengarrett33 commented 5 months ago

I am trying to create a HSDS to serve data for my application. When I make multiple parallel get's to the service from my application, I frequently get 503's. I can mitigate this problem by implementing a generous retry policy in my application, but my question is: Is this expected? Are there any ways to improve availability \ reliability of the HSDS? I see in the data owner's documentation mention of the same error I am seeing, but it isn't clear to me exactly what the problem is or how my HSDS deployment could be more robust to parallel requests.

To reproduce: Start a HSDS container using latest image: docker-compose.yml

version: "3.8"
services:
  hsds:
    image: hdfgroup/hsds
    volumes:
      - ~/.aws:/root/.aws
    ports:
      - "5101:5101"
    environment:
      - LOG_LEVEL=INFO
      - AWS_S3_GATEWAY=http://s3.us-west-2.amazonaws.com
      - AWS_S3_NO_SIGN_REQUEST=1
    entrypoint: hsds
docker compose up hsds

Then make parallel requests for data using a client, in this case h5pyd:

from concurrent.futures import ThreadPoolExecutor
import h5pyd

keys = ['ghi', 'dni', 'dhi', 'air_temperature', 'surface_albedo']

def data_worker():
    with h5pyd.File(
        '/nrel/nsrdb/current/nsrdb_tmy-2022.h5',
        endpoint='http://localhost:5101',
        bucket='nrel-pds-hsds',
        retries=1,
    ) as h5_file:
        for key in keys:
            h5_file.get(key)[:, 100:200]

def main():
    with ThreadPoolExecutor(5) as executor:
        futures = [executor.submit(data_worker) for _ in range(10)]
        for future in futures:
            future.result()

if __name__ == '__main__':
    main()

This will (usually) fail with:

ERROR:root:got <class 'requests.exceptions.RetryError'> exception: HTTPConnectionPool(host='localhost', port=5101): Max retries exceeded with url: /datasets/d-1d865ecb-1c251681-7021-33f71a-2c6661/value?nonstrict=1&select=%5B0%3A8760%3A1%2C100%3A200%3A1%5D&domain=%2Fnrel%2Fnsrdb%2Fcurrent%2Fnsrdb_tmy-2022.h5&bucket=nrel-pds-hsds (Caused by ResponseError('too many 503 error responses'))
Traceback (most recent call last):
  File ".venv/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 894, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 884, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5101): Max retries exceeded with url: /datasets/d-1d865ecb-1c251681-7021-33f71a-2c6661/value?nonstrict=1&select=%5B0%3A8760%3A1%2C100%3A200%3A1%5D&domain=%2Fnrel%2Fnsrdb%2Fcurrent%2Fnsrdb_tmy-2022.h5&bucket=nrel-pds-hsds (Caused by ResponseError('too many 503 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".venv/lib/python3.11/site-packages/h5pyd/_hl/httpconn.py", line 474, in GET
    rsp = s.get(
          ^^^^^^
  File ".venv/lib/python3.11/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPConnectionPool(host='localhost', port=5101): Max retries exceeded with url: /datasets/d-1d865ecb-1c251681-7021-33f71a-2c6661/value?nonstrict=1&select=%5B0%3A8760%3A1%2C100%3A200%3A1%5D&domain=%2Fnrel%2Fnsrdb%2Fcurrent%2Fnsrdb_tmy-2022.h5&bucket=nrel-pds-hsds (Caused by ResponseError('too many 503 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".venv/lib/python3.11/site-packages/h5pyd/_hl/dataset.py", line 1160, in __getitem__
    rsp = self.GET(req, params=params, format="binary")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/h5pyd/_hl/base.py", line 962, in GET
    rsp = self.id._http_conn.GET(req, params=params, headers=headers, format=format, use_cache=use_cache)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/h5pyd/_hl/httpconn.py", line 490, in GET
    raise IOError("Unexpected exception")
OSError: Unexpected exception

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bgarrett-mac/Documents/workspace/hsds/call.py", line 28, in <module>
    main()
  File "/Users/bgarrett-mac/Documents/workspace/hsds/call.py", line 24, in main
    future.result()
  File "/Users/bgarrett-mac/.asdf/installs/python/3.11.4/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/bgarrett-mac/.asdf/installs/python/3.11.4/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/bgarrett-mac/.asdf/installs/python/3.11.4/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bgarrett-mac/Documents/workspace/hsds/call.py", line 17, in data_worker
    h5_file.get(key)[:, 100:200]
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/h5pyd/_hl/dataset.py", line 1169, in __getitem__
    raise IOError(f"Error retrieving data: {ioe.errno}")
OSError: Error retrieving data: None

This indicates the client is receiving a 503 from the HSDS. The log file hs.log does not indicate any sort or error occurred in the HSDS container. Adding retries usually allows me to get the data I need from the HSDS, but I am trying to understand why the service is unavailable and if there is a way to configure the HSDS to be more robust to parallel requests.

jreadey commented 5 months ago

Thanks for trying out HSDS!

The 503 responses aren't failures as such - it's just the server telling you to lower the rate of requests and retry the given request after a bit of delay.

Backgroiund: Each HSDS container keeps track of the number or inflight requests. If it exceeds a certain limit (100 by default), it will fail the next requests with a 503 response. The idea is not to overtax the server or having issues with the container running out of memory.

The actual effective limit depends greatly on the type of load coming in (basically how memory or cpu intensive the requests are on average). You can change the default by creating a file: hsds/admin/config/override.yml with the following line: "max_tax_count: 999" where 999 is whatever you'd like to max_task_count to be. Restart the server to have it take effect.

If you see the containers regularly hitting 100% CPU, or restarting because of out of memory errors, you've probably set max_task_count to high and you'll want to scale it back a bit.

Let me know if this helps!

bengarrett33 commented 3 months ago

Very helpful thank you!