timeout error reported on client disconnect - retry doesn't work

gilv commented 4 years ago

From time to time, on random basis, we manage to hit "timeout on reading" from COS while reading data in many parallel threads. Working with COS team it was identified that the real cause of the problem is "client disconnect", then sdk halts till timeout happens and report timeout error instead of client disconnect error.

We tried to experiment with all kind of retry parameters in cos-sdk however they don't work with client disconnect issues.

It looks like that retries config parameter for ibm-cos-sdk is only used by the high-level download_fileAPI function, not the get_object function we use.

To resolve the issue we manually did try-catch-retry. This code worked for us to resolve the problem:

   try:
        data_stream = ibm_cos.get_object(Bucket=ds_bucket, Key=ds_segm_key)['Body']
        data =data_stream.read()
    except Exception as ex:
        data_stream = ibm_cos.get_object(Bucket=ds_bucket, Key=ds_segm_key)['Body']
        data = data_stream.read()

This code didn't solved the problem

 data_stream = ibm_cos.get_object(Bucket=ds_bucket, Key=ds_segm_key)['Body']
    try:
        data = data_stream.read()
    except Exception as ex:
        data = data_stream.read()

We believe the right approach is cos-sdk to have retry on client disconnect internally.

This is link to code where we configure cos sdk This is link to log error generated by cos sdk reporting timeout, when client disconnect happened

gilv commented 4 years ago

just to be sure there is no confusion here - Added max req. attempts in COS client #257 is not related to the issue i reported

rtveitch commented 4 years ago

After looking into this a bit, the following line in the StreamingBody class appears to be where this exception is getting thrown: https://github.com/IBM/ibm-cos-sdk-python-core/blob/master/ibm_botocore/response.py#L79

    def read(self, amt=None):
        """Read at most amt bytes from the stream.
        If the amt argument is omitted, read all data.
        """
        try:
            chunk = self._raw_stream.read(amt)
        except URLLib3ReadTimeoutError as e:
            # TODO: the url will be None as urllib3 isn't setting it yet
            raise ReadTimeoutError(endpoint_url=e.url, error=e)
        self._amount_read += len(chunk)
        if amt is None or (not chunk and amt > 0):
            # If the server sends empty contents or
            # we ask to read all of the contents, then we know
            # we need to verify the content length.
            self._verify_content_length()
        return chunk

It's not clear to me how we would improve on this in the case of a basic get_object call on the underlying S3 client - StreamingBody is essentially a thin wrapper over the underlying socket, so when a timeout occurs during the read() call the socket is likely closed and at this layer of the SDK there isn't any knowledge of S3 requests so there is no way to retry. This explains why your first example works (retrying the request yourself and calling read() on the new instance of StreamingBody) but the second doesn't (attempt to call read() again on the same instance of StreamingBody).

You mentioned that retries are only available in the download_file high level method, is there a reason your integration cannot use this method? If it is because you want to download the data to memory rather than to a file on disk, it is true this method seems to have the limitation that you need to specify a file as a string which it will then explicitly open for writing. However, there does seem to be a variant on this method called download_fileobj which takes a file-like object. You may be able to construct a request using an instance of StringIO as your file object to get the benefits of retries while still storing the resulting data in memory.

rtveitch commented 4 years ago

Based on internal discussion we think that this timeout error and associated enhancements are no longer in scope for the SDK:

The SDK provides methods that are basically wrappers for base API (for e.g. a method to call PUT Object), and methods that support higher level logic to accomplish higher level tasks that underneath use wrappers and API (for e.g. transfer manager)
The base wrapper API wrapper methods have NO retry logic in them, and the expectation is that the client catches exceptions and retries when using these methods, or that they use the higher level methods in the SDK that have more logic including retries

@gilv I will close this issue for now, please reach out if you have additional concerns or ideas.

IBM / ibm-cos-sdk-python

timeout error reported on client disconnect - retry doesn't work #32