Closed bilalshaikh42 closed 2 years ago
I've never seen anything like this but I'm more often using Amazon's S3 which seems to rarely have issues.
Do these errors correlate with a particular type of usage? E.g. only under heavy load?
I put in a fix for datanode.py line 163 - that was just missing the argument for the logger.
Yes, this was under high load, so its possible that the networking for the host was saturated and unable to reach the bucket. There may be some missing catch statements to prevent this from bubbling up to the top and causing the application to crash. I'll try and see if I can find where that might be needed.
Yes, likely there needs to be more robust handling for reads/writes to the storage system.
For read errors, we can just return a 500 to the client and have the client retry the request (ideally with some sort of exponential back off time). Errors during the bucketScan operation are non-critical and the server can just retry the scan after a bit. Errors for write operations need to be caught and retried by the s3sync task - it's critical to ratain the data in memory till it's successfully written.
I'll look into to adding some test scaffolding the randomly causes the storage reads/writes to fail - in the style of Netflix's chaos monkey: https://github.com/Netflix/chaosmonkey. That will make it easier to reproduce these types of scenerios.
Hey @bilalshaikh42 I've checked in some changes to master that should make things more stable under high load. Not sure it will help your specific setup or not, but I'd be interested to hear any feedback if you can give it a try.
To simulate high load work flows, I created the test: hsds/tests/perf/write that runs a set of pods all writing to one dataset.
Sure, I can test this out and see if it helps! Is there a docker image with the changes available?
I have been able to resolve our issues for now by simply disabling the metadata cache and chunk cache by settings the size to 0. This works for us since we are only having heavy write loads and light read loads, but I presume this is not an ideal solution. It might give some hint to where the issues might lie, however.
I've pushed the latest image to docker hub as: hdfgroup/hsds:ad8597f
Strange to hear about disabling the caching. I would have thought that wouldn't work - and when I tried setting the cache configs to 0, I do get an exception! (divide by zero errror)
Anyway try this image with and without the cache settings. For reading I do see quite a bit of speed up when an item is in the cache. Generally about 2x compared to when it has to retrieve from S3.
This latest image: hdfgroup/hsds:36a7c61 might be even better!
This seems to be fully resolved!
great - I'll close the issue then.
Hello, I have seen the following errors pop up on the error reporting. They seem to all be related. I am trying to figure out what could be causing these. I suspect some connection to the bucket failed, but rather than handling it gracefully, one of the pods may have crashed. This then causes the lockup issue described in #104. I am not sure if this is the sequence of events but seems like it based on observing the requests ( most recent error last).