HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

Key Error #98

Closed bilalshaikh42 closed 2 years ago

bilalshaikh42 commented 3 years ago

When we are fetching 5-10 datasets from the API , we run into times when we get a series of 500 responses: It feels like some sort of caching issue, as repeating the call eventually returns the data. The error trace is below. I can provide more logs and debug info if needed.

KeyError: 'd-39d747f4-935cb3a3-92f7-d796ab-022100'
at get_metadata_obj (/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py:304)
at GET_Attributes (/usr/local/lib/python3.8/site-packages/hsds/attr_dn.py:65)
at _handle (/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py:458)
at start (/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py:418)
jreadey commented 3 years ago

I haven't seen that before... are you running the current master branch? Line 304 doesn't seem to correspond to a dict access: https://github.com/HDFGroup/hsds/blob/master/hsds/datanode_lib.py#L304.

bilalshaikh42 commented 3 years ago

I am using the docker image, with the pull policy set to the latest, so it should be the latest code. I think the key error refers to an S3 key. The request for a particular dataset leads to the method on line 304 getting called, which I believe is finding the id of the root group, and that is where the 'key' is not found in s3.

Here is the log output from the datanode. It seems to be caused by some issue retrieving the data from the s3 bucket, but I am not sure how to further investigate which end of the connection (hsds or the bucket) is having trouble.

REQ> GET: /datasets/d-a95583a5-3ed7b37a-5585-de695e-0a4026/attributes [10.20.2.131:6101]
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/domain_dn.py", line 66, in GET_Domain
    domain_json = await get_metadata_obj(app, domain)
  File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
    elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'biosimdev/results/60f1307bc9fb5400db9f0cad'
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/domain_dn.py", line 66, in GET_Domain
    domain_json = await get_metadata_obj(app, domain)
  File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
    elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'biosimdev/results/60f1307bc9fb5400db9f0cad'
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/domain_dn.py", line 66, in GET_Domain
    domain_json = await get_metadata_obj(app, domain)
  File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
    elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'biosimdev/results/60f1307bc9fb5400db9f0cad'
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/domain_dn.py", line 66, in GET_Domain
    domain_json = await get_metadata_obj(app, domain)
  File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
    elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'biosimdev/results/60f1307bc9fb5400db9f0cad'

For this particular key, the next retry worked just fine.

Here is another example. This time, we do have a warning about not being able to read a specific key. But "keyError" to me implies that there is no such object, not just a network timeout.

WARN> s3 read for object g-2e18bc11-3201fa3a-bf2c-abc090-0d3b83 timed-out, initiaiting a new read
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/link_dn.py", line 67, in GET_Links
    group_json = await get_metadata_obj(app, group_id, bucket=bucket)
  File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
    elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'g-2e18bc11-3201fa3a-bf2c-abc090-0d3b83'

Unfortunately, I cannot reproduce the error consistently. Is there some error handling/retry that could be added?

jreadey commented 3 years ago

I think this issue got fixed sometime after the last push to docker hub. Could you try this image: hdfgroup/hsds:v0.7.0beta3 and let me know how that works? BTW, I think there are some issues with using latest tag and pulling the right image from Docker hub. See: https://stackoverflow.com/questions/37565507/pulling-the-latest-image-from-dockerhub. It's safer to just use an explicit tag.

bilalshaikh42 commented 3 years ago

I'll update the tag, and see if the error pops up again. Thank you!

jreadey commented 2 years ago

Closing this - please re-open if you see the error again.