IBM / ibm-cos-sdk-python-core

ibm-cos-sdk-python-core
Apache License 2.0
6 stars 14 forks source link

Refreshing temporary credentials failed during advisory refresh period #16

Closed gaskinner84 closed 1 year ago

gaskinner84 commented 2 years ago

At around 2022-08-28T09:10:03-0400 we noticed our live page on usopen.org that shows our match insights for tennis went blank. This data comes from our CDN that has an origin of Cloud Object Storage (COS). We have a published that runs every 10 minutes and updates data in the buckets.

However, I saw the following errors from IAM, which appears in our code engine logs. It seems as though the IAM service periodically fails. It recovered fully around 10am EST.

This has caused impact on our other projects too.

Could we understand why this disruption of service happens and how we can mitigate it?

Here is the error:

WARNING:ibm_botocore.credentials:Refreshing temporary credentials failed during advisory refresh period. Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen chunked=chunked, File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 381, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn conn.connect() File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 371, in connect sslcontext=context, File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl.py", line 386, in ssl_wrap_socket return context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/local/lib/python3.7/ssl.py", line 423, in wrap_socket session=session File "/usr/local/lib/python3.7/ssl.py", line 870, in _create self.do_handshake() File "/usr/local/lib/python3.7/ssl.py", line 1139, in do_handshake self._sslobj.do_handshake() OSError: [Errno 0] Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 410, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 734, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen chunked=chunked, File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 381, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn conn.connect() File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 371, in connect sslcontext=context, File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl.py", line 386, in ssl_wrap_socket return context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/local/lib/python3.7/ssl.py", line 423, in wrap_socket session=session File "/usr/local/lib/python3.7/ssl.py", line 870, in _create self.do_handshake() File "/usr/local/lib/python3.7/ssl.py", line 1139, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError(0, 'Error'))

1) COS 'Bucket name'where these requests are being sent to a: 'fact-sheet-output' and 'fact-sheet-output-dev'

2) 'Source IP' address of machine making requests to COS with boto application a: The application is running in IBM Code Engine: wpipublisherprod.95zyfbovnzk.us-south.codeengine.appdomain.cloud

3) Can you give us more insight into the frequency of these errors? How often do these happen? a: 100 times per hour

IBM support found the following github related to the same errors using boto3:

https://github.com/IBM/ibm-cos-sdk-python-core/issues/15

We reached out to #cos-sdk-support (https://ibm-cloudplatform.slack.com/archives/C01CSELQ2G2/p1661772831078759) and confirmed this is related to https://github.com/IBM/ibm-cos-sdk-python-core/issues/15 and https://github.com/IBM/ibm-cos-sdk-python-core/issues/12.

The solution proposed in the above slack convo was:


The default time out value is 5 sec and it seems that for the customer use-case, it is not sufficient so while creating the client, he can change the value of this timeout by the following code DefaultTokenManager.set_default_auth_function_timeout = <more time than 5 sec>

The client has additional questions regarding this:


I am creating a new COS instance with:

self.cos = ibm_boto3.client(service_name='s3', ibm_api_key_id=kwargs['api_key'], ibm_service_instance_id=kwargs['instance_id'], config=Config(signature_version='oauth'), endpoint_url=kwargs['endpoint_url'] )

The code you referenced DefaultTokenManager.set_default_auth_function_timeout = x, what package is the DefaultTokenManager within?

As of now, I am using:

from ibm_botocore.client import Config, ClientError

Would I need to import another package?

We are live for the US Open now for day 1 so I am hesistant to make any code changes in production. We will have other events where this might need to implemented such as ESPN Fantasy Football.

Alternatively, is there any changes or scaleouts you can make on your end to resolve this from a CODe engine perspective?

Regards,

Gary S. ACS-Storage Support Lead IBM Cloud Support

avinash1IBM commented 2 years ago

The default timeout value is 5 sec. In your case it seems like the default value of 5 sec is not sufficient. To resolve this problem, you can increase the timeout value to a value greater that 5 sec according to your use case.

Creating a client to set default timeout value is as below

// Need to import this 
from ibm_botocore.credentials import DefaultTokenManager
// creating client
token_manager = DefaultTokenManager(api_key_id=<api-key>, service_instance_id=<service-instance-id>)
DefaultTokenManager.set_default_auth_function_timeout = <timeout value for your usecase>
client = ibm_boto3.client(service_name='s3', token_manager=self.token_manager,
                          config=Config(signature_version='oauth'), endpoint_url=<end-point>)

If your issue got resolved, please close this. Thank you

gaskinner84 commented 2 years ago

Hello - Is this a problem with the SDK or version that the customer is using? (IE bug). Or is the SDK team making a recommendation based on just the error?

If there any sort of debugging we could do to help better identify the issue and why it may be occurring?

Regards,

Gary S. ACS-Storage Support Lead IBM Cloud Support

avinash1IBM commented 2 years ago

Hello, This is not a problem with SDK or version. The default value of timeout that SDK set is 5 sec and looking at the error that you have it seems like that 5 sec is not sufficient for your case and it might be due various reasons and one of them might be due to client environmental setting. So it is a standard approach we followed and provided a way to the user to change the default values according to user's use case and environment. It is not a bug.

arnabm28 commented 1 year ago

@gaskinner84 In case the above explanation is satisfactory can we close this issue.

Thanks

avinash1IBM commented 1 year ago

Closing this ticket as resolved.