IBM / ibm-cos-sdk-python

ibm-cos-sdk-python
Apache License 2.0
45 stars 26 forks source link

Getting 400 Bad Request From Server While Fetching Token Against API #63

Open alidurraniwanclouds opened 3 months ago

alidurraniwanclouds commented 3 months ago

import ibm_boto3 from ibm_botocore.client import Config

api_key = 'XXXXXXX' service_instance_id = 'XXXXXXXXXXX' auth_endpoint = 'https://iam.cloud.ibm.com/identity/token/' endpoint_url = 's3.XXXXXXXXXXXXXX' s3 = ibm_boto3.resource('s3', ibm_api_key_id=api_key, ibm_service_instance_id=service_instance_id, ibm_auth_endpoint=auth_endpoint, config=Config(signature_version='oauth'), endpoint_url=endpoint_url)

This is pseudocode I'm using multithreading to copy objects from one bucket to another and one of my task is struck in this situation

ibm_botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from https://iam.cloud.ibm.com/identity/token: HttpCode(400) - Retrieval of tokens from server failed.

avinash1IBM commented 3 months ago

@alidurraniwanclouds The code that you had is correct. With the error information that you shared, it is a 400, and it is possible that your api key might have expired/deleted. Can you please double check whether you are using the right api key.

alidurraniwanclouds commented 3 months ago

@avinash1IBM I'll double check the above mentioning, but meanwhile I want to share my use case as currently lets suppose there are 400k objects in bucket and I'm copying those objects to other bucket for optimization and fast execution I've been using multi-processing and threading concepts so there are concurrent calls so when ever I copy huge numbers of objects from a bucket to another it gives.

ibm_botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "https://s3.XXXX/bucket_name/file-197948.txt"

my configurations are

ibm_boto3.client( 's3', target_cos_credentials, config=Config( max_pool_connections=100, connect_timeout=3600, read_timeout=3600)**

Note: it works fine if there are less number of objects like 50,60 etc.

I can send complete traceback if you want in order for debugging.

avinash1IBM commented 3 months ago

Hey @alidurraniwanclouds For your use case you can configure Replication rule on a bucket and cos takes care of copying your objects from one bucket to other. You can try this. Coming to the error that you shared, are the objects large in which case it might cause the timeout issues or it also can be due to the concurrent connections as well. It would be helpful if you share the stack track for debugging.

alidurraniwanclouds commented 3 months ago

Sure @avinash1IBM here is the trace back Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 404, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn conn.connect() File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/usr/local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/usr/local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/local/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/usr/local/lib/python3.10/ssl.py", line 1104, in _create self.do_handshake() File "/usr/local/lib/python3.10/ssl.py", line 1375, in do_handshake self._sslobj.do_handshake() TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/ibm_botocore/httpsession.py", line 455, in send urllib_response = conn.urlopen( File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen retries = retries.increment( File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 525, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise raise value File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen httplib_response = self._make_request( File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 407, in _make_request self._raise_timeout(err=e, url=url, timeout_value=conn.timeout) File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 358, in _raise_timeout raise ReadTimeoutError( urllib3.exceptions.ReadTimeoutError: AWSHTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=3600)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/vpcplus-ibm-be/ibm/tasks/draas_tasks/utils.py", line 342, in copy_bucket_object copied_response = target_bucket_client.copy_object(CopySource=copy_source, Bucket=target_bucket_name, File "/usr/local/lib/python3.10/site-packages/ibm_botocore/client.py", line 531, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.10/site-packages/ibm_botocore/client.py", line 947, in _make_api_call http, parsed_response = self._make_request( File "/usr/local/lib/python3.10/site-packages/ibm_botocore/client.py", line 970, in _make_request return self._endpoint.make_request(operation_model, request_dict) File "/usr/local/lib/python3.10/site-packages/ibm_botocore/endpoint.py", line 119, in make_request return self._send_request(request_dict, operation_model) File "/usr/local/lib/python3.10/site-packages/ibm_botocore/endpoint.py", line 231, in _send_request raise exception File "/usr/local/lib/python3.10/site-packages/ibm_botocore/endpoint.py", line 281, in _do_get_response http_response = self._send(request) File "/usr/local/lib/python3.10/site-packages/ibm_botocore/endpoint.py", line 377, in _send return self.http_session.send(request) File "/usr/local/lib/python3.10/site-packages/ibm_botocore/httpsession.py", line 492, in send raise ReadTimeoutError(endpoint_url=request.url, error=e) ibm_botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "https:////file-145746.txt"

avinash1IBM commented 2 months ago

@alidurraniwanclouds What is the api call that you are making? Is it just the copy_object or any other api call that is being made. API calls can take longer than expected when network connection issues occur. If you make an API call using the SDK and the call fails, the sdk automatically retries the call. The default retry is 3. Have you tried updating the boto3 timeout and retry configurations? Here is the ibm_boto3 documentation on configuration

alidurraniwanclouds commented 2 months ago

Hello @avinash1IBM Yes I'm just using copy_object , I'm just gonna share my configurations of my copy client which are :- config = Config( tcp_keepalive=True, max_pool_connections=200, connect_timeout=3600, read_timeout=3600, # Set to 1 hour retries={ 'max_attempts': 0, 'mode': 'standard' } ) target_bucket_client = ibm_boto3.client('s3', config=config, **target_cos_credentials)

Note :- The reason I have set max_attempts = 0 it creates multiple versions of objects if max_retries are 3,4 etc I don't want to do this so that's why it's set to be 0, also it copies like 39990 objects if there 400k objects in bucket in some objects it gives error after retrying it copies the remaining objects and size of the objects are round about 30kb.

avinash1IBM commented 2 months ago

The reason I have set max_attempts = 0 it creates multiple versions of objects if max_retries are 3,4 etc

It will not be a problem, since the retry only happens on a failure, not on a success.

alidurraniwanclouds commented 2 months ago

I mean even in case of failure the use case is that if there is only one or current version in source_bucket there should not be 2 versions of same object in target_bucket. and also it's not that I'm only facing this issue in one bucket I'm facing the same issue having large number of objects. so in my use case failure will be frequently and failure means another version.

avinash1IBM commented 2 months ago

failure means another version. This means that you code is making that request more than once and that might be from different threads. only when there is an object with a same key name and you try to upload another object with the same name then only it creates a new version of that object.

alidurraniwanclouds commented 2 months ago

No there is only one call for each object the issue is that whenever I assign value to max_retries and lets suppose first time it got failed and SDK retries the second time it will create version from my side or logic there is only one call for each object that to be copied.

avinash1IBM commented 2 months ago

The SDK retries only when a request fails due to certain retryable errors such as network issues, throttling, or server-side problems. Successful requests are not retried. So in your case, it the first request got failed, that means the object is not copied to destination bucket. SDK doesn't maintain any state and it can't create a version.

alidurraniwanclouds commented 2 months ago

@avinash1IBM I got your point but the issue is that SDK is not working in the way your'e telling it creates version in case of retry and also let's suppose if there is size of 30 MB the version object has size of 18 MB and the current is 30 MB. let me elaborate my point lets suppose there is object size of 30 MB in source_bucket when retries happen in the target bucket the versions of objects will be 18 MB 15 MB (object do not get copy completely) after the retries if success case happens it will create current version as it as the original size of object in source_bucket.

avinash1IBM commented 2 months ago

@alidurraniwanclouds So are you seeing a failure case when the objects are not completely copied? what was the status code that you got during those failures. The copy object operation is an atomic operation meaning the object is either fully copied or not copied at all. But you are saying that the object is copied partially. So there might be a chance that your code is not exactly doing what you think it might be doing.

alidurraniwanclouds commented 2 months ago

@avinash1IBM it raises the same exception as I've mentioned below regarding ReadTimeOut.

Note :- sometimes it copy the whole object as a version sometimes it copy partial.

avinash1IBM commented 2 months ago

@alidurraniwanclouds If you are receiving a ReadTimeOut error, that means that the request was a failure in which case no object would have copied. So please clarify this, are you getting a ReadTimeOut error and then you observe a object partially copied to the destination bucket?

alidurraniwanclouds commented 2 months ago

@avinash1IBM yes that's the use case.

alidurraniwanclouds commented 2 months ago

@avinash1IBM I've also confirmed that the API keys are valid but sometimes we face ibm_botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from https://iam.cloud.ibm.com/identity/token: HttpCode(400) - Retrieval of tokens from server failed.

moisesrc13 commented 2 months ago

hi, I also have this error, on my local runs fine, but getting HttpCode(400) when running the service on IBM Cloud network, I noticed it does not retry neither waits the time from my config

config=Config(
                signature_version="oauth",
                max_pool_connections=5,
                connect_timeout=60,
                read_timeout=60,
                retries={
                    "max_attempts": 3,
                    "mode": "standard"
                }
            )
avinash1IBM commented 2 months ago

@alidurraniwanclouds For the Credentials retrieval error, 400 is a bad digest and general cause of it might be the api key is expired, but you said that api key is correct. So can you enable logging and share the request id of the failed request in the debug logs with me. You can get that by enabling debug logs like below

import logging
# logging.basicConfig(filename='debug_python.log', filemode='w', level=logging.DEBUG)

PS: Don't share the entire debug logs as it contains the IAM token information, so if you are sharing that, double check that you are not sharing any sensitive information

avinash1IBM commented 2 months ago

@moisesrc13 400 errors will not be retried since 400's are related to client side issues. Can you share the error log for the 400 that you got.

alidurraniwanclouds commented 2 months ago

@avinash1IBM I can't enable logging as it is in our Prod environment is there any other way you can debug that issue?

faizansiddique11 commented 1 month ago

Hi there, I am also getting a similar error and some background thread errors as well.

ibm_botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from https://iam.cloud.ibm.com/identity/token: HttpCode(400) - Retrieval of tokens from server failed.

ERROR Exiting background refresh thread: Error when retrieving credentials from https://iam.cloud.ibm.com/identity/token: HttpCode(400) - Retrieval of tokens from server failed.

https://github.com/IBM/ibm-cos-sdk-python-core/blob/master/ibm_botocore/credentials.py#L2624 can someone explain this? why would this cause issue?

avinash1IBM commented 1 month ago

@faizansiddique11 HTTP 400 is a client side issue. So it might occur to you because of the api-key that you are using is expired/incorrect. Python sdk handles the token management and when the token is about to expires, the background thread tries to refresh the token using the same api-key. Is the issue intermittent?

faizansiddique11 commented 1 month ago

@avinash1IBM Yes the key is not expired/incorrect and my tasks are also completed successfully, I am only seeing this error log, not sure what to do and how to fix it.

faizansiddique11 commented 1 month ago

@avinash1IBM I am using cos hmac keys, can that be an issue?

avinash1IBM commented 1 month ago

@faizansiddique11 If you are using hmac keys, you will not encounter the token retrieval error that you mentioned. Only with api-key(iam) you might encounter the above error. So there might be some other client initialization happening in your code with api-key that is causing the error. SDK retries fetching the token if failed upto 3 times. So you might have saw that in your logs.

faizansiddique11 commented 1 month ago

@avinash1IBM Yes i am also using api key for some calls. like for listing buckets I use api key but for copying objects in between buckets I use hmac keys and also I perform these copy operations in multiprocessing-multithreading. so is it possible that IBM through 400 sometimes but the real error is something else? as the API key is correct. it works after these issues too.

avinash1IBM commented 1 month ago

@faizansiddique11 The issue then is not related to copy_operations at all since you are using the hmac keys. The issue while doing list_api might be due to the account settings. you can refer this similar issue

faizansiddique11 commented 1 month ago

@avinash1IBM I did change the account settings to a lower number but could get this issue, this issue is for a customer, and couldn't reproduce it on my end.

also it would be extremely helpful if I can get the exact reason from token fetch api call as to why the call gave 400?