ReadTimeoutError in DefaultTokenManager

LachlanStuart commented 3 years ago

I see ReadTimeoutErrors logged once or twice a week. Usually they seem to be just a warning emitted by the background refresh thread, but it has been raised at least once during a call to DefaultTokenManager.get_token, which has caused my application to fail.

Here's the stack trace of the error during a call to DefaultTokenManager.get_token. This code was running on a relatively low-traffic EC2 instance on AWS (Ireland region), so it's extremely unlikely to be caused by network issues on our end.

And here is one from the background refresh thread running on my local development computer. This one was likely due to a network lag spike. I was not using any IBM APIs at the time, so getting my Python console overrun by a warning message about a non-vital background process was quite unwelcome.

I'm using version 2.9.0 which was bundled with ibm-cos-sdk==2.9.0

DefaultTokenManager seems to deliberately set the timeout to 5 seconds. This seems quite low if only one attempt is made before failure.

I haven't seen any cases in the logs where this error has occurred multiple times in quick succession. Based on this, I think this could be resolved by wrapping this function with retry logic, possibly with increasing timeouts on each attempt, and preferably only logging an error if it fails after multiple retries.

nglange commented 3 years ago

Yeah, it's really weird that is hardcoded like that. Timeouts with the token manager are a known thorn, and I believe that it's being worked on. I'm not sure if the best fix is to have a configurable timeout, or to have a number of retried with exponential backoff, like other transient 5xx errors. Thanks for raising the issue - I'll check with the team and someone will post an update here.

roryodonnell commented 3 years ago

I'm also experiencing a issue that seems related to this with ibm-cos-sdk==2.6.3

Jun 3 21:36:24 staging-serv-upload-v2-56c69fc465-wk5s9 serv-upload-v2 WARNING WARNING:ibm_botocore.credentials:Refreshing temporary credentials failed during advisory refresh period.
Jun 3 21:36:24 staging-serv-upload-v2-56c69fc465-wk5s9 serv-upload-v2 Traceback (most recent call last):
  File "/home/fd4b/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/fd4b/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1332, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 264, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
Jun 3 21:36:24 staging-serv-upload-v2-56c69fc465-wk5s9 serv-upload-v2 socket.timeout: The read operation timed out
Jun 3 21:36:24 staging-serv-upload-v2-56c69fc465-wk5s9 serv-upload-v2 During handling of the above exception, another exception occurred:
Jun 3 21:36:24 staging-serv-upload-v2-56c69fc465-wk5s9 serv-upload-v2 Traceback (most recent call last):
  File "/home/fd4b/.local/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/home/fd4b/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/home/fd4b/.local/lib/python3.8/site-packages/urllib3/util/retry.py", line 410, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/fd4b/.local/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/home/fd4b/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/home/fd4b/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/fd4b/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
    raise ReadTimeoutError(

IBMeric commented 3 years ago

FYI, this issue is related to https://github.com/IBM/ibm-cos-sdk-python-core/issues/15.

IBMeric commented 3 years ago

We have implemented an override in 2.11.0, though the changelog mistakenly omits this. Can you verify and close this ticket?

LachlanStuart commented 3 years ago

Thanks for the response @IBMeric .

I can't really test it on demand as it's a sporadic issue, but I checked the code changes and unless I'm missing something... I mean, technically, yes, the new DefaultTokenManager.set_default_auth_function_timeout function allows me to set a more reasonable timeout now, but the default is still an unreasonably short 5 seconds, and it looks like it will still spam my development PC with warning log messages from a background thread if I go offline...

I guess this can be closed as it technically resolves it from my side.... it's just... really not great API design when the library is unstable by default and the onus is on the developer to call an undocumented function just to reinstate sane defaults...

IBM / ibm-cos-sdk-python-core

ReadTimeoutError in DefaultTokenManager #12