Closed bilalshaikh42 closed 3 years ago
I haven't seen that before... are you running the current master branch? Line 304 doesn't seem to correspond to a dict access: https://github.com/HDFGroup/hsds/blob/master/hsds/datanode_lib.py#L304.
I am using the docker image, with the pull policy set to the latest, so it should be the latest code. I think the key error refers to an S3 key. The request for a particular dataset leads to the method on line 304 getting called, which I believe is finding the id of the root group, and that is where the 'key' is not found in s3.
Here is the log output from the datanode. It seems to be caused by some issue retrieving the data from the s3 bucket, but I am not sure how to further investigate which end of the connection (hsds or the bucket) is having trouble.
REQ> GET: /datasets/d-a95583a5-3ed7b37a-5585-de695e-0a4026/attributes [10.20.2.131:6101]
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/domain_dn.py", line 66, in GET_Domain
domain_json = await get_metadata_obj(app, domain)
File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'biosimdev/results/60f1307bc9fb5400db9f0cad'
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/domain_dn.py", line 66, in GET_Domain
domain_json = await get_metadata_obj(app, domain)
File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'biosimdev/results/60f1307bc9fb5400db9f0cad'
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/domain_dn.py", line 66, in GET_Domain
domain_json = await get_metadata_obj(app, domain)
File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'biosimdev/results/60f1307bc9fb5400db9f0cad'
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/domain_dn.py", line 66, in GET_Domain
domain_json = await get_metadata_obj(app, domain)
File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'biosimdev/results/60f1307bc9fb5400db9f0cad'
For this particular key, the next retry worked just fine.
Here is another example. This time, we do have a warning about not being able to read a specific key. But "keyError" to me implies that there is no such object, not just a network timeout.
WARN> s3 read for object g-2e18bc11-3201fa3a-bf2c-abc090-0d3b83 timed-out, initiaiting a new read
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/link_dn.py", line 67, in GET_Links
group_json = await get_metadata_obj(app, group_id, bucket=bucket)
File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 304, in get_metadata_obj
elapsed_time = time.time() - pending_s3_read[obj_id]
KeyError: 'g-2e18bc11-3201fa3a-bf2c-abc090-0d3b83'
Unfortunately, I cannot reproduce the error consistently. Is there some error handling/retry that could be added?
I think this issue got fixed sometime after the last push to docker hub. Could you try this image: hdfgroup/hsds:v0.7.0beta3 and let me know how that works? BTW, I think there are some issues with using latest tag and pulling the right image from Docker hub. See: https://stackoverflow.com/questions/37565507/pulling-the-latest-image-from-dockerhub. It's safer to just use an explicit tag.
I'll update the tag, and see if the error pops up again. Thank you!
Closing this - please re-open if you see the error again.
When we are fetching 5-10 datasets from the API , we run into times when we get a series of 500 responses: It feels like some sort of caching issue, as repeating the call eventually returns the data. The error trace is below. I can provide more logs and debug info if needed.