IBM / ibm-cos-sdk-python

ibm-cos-sdk-python
Apache License 2.0
44 stars 26 forks source link

COS Python SDK (ibm_boto3_core) memory leak #52

Closed KyungmoIBM closed 12 months ago

KyungmoIBM commented 1 year ago

A customer reported a long running python application that uses ibm_boto3_core that the application is leaking threads / memory. Cusomer traced the issue to ibm_boto3_core. The below code illustrates the leak. Each iteration of the for loop leaks a thread at:

File "/Users/<user>/.local/share/virtualenvs/datalake-hive-table-manager-1Zni2EJ6/lib/python3.8/site-packages/ibm_botocore/credentials.py", line 2472, in _background_refresher
    self._background_thread_wakeup_event.wait(new_remaining)

When s3_client goes out of scope, the DefaultTokenManager's background_refresher thread stays alive.

I also notice that ibm_boto3_core registers an an exit handler to stop the thread (atexit.register(self._cleanup)). But, this doesn't help for long running applications.

api_key = os.environ.get('API_KEY')

bucket = "<bucket name>"
key = "<object key name>"

for i in range(100):
    print(f"\n\nIteration: {i}")
    s3_client = ibm_boto3.client('s3', endpoint_url=endpoint,
                                 config=Config(signature_version="oauth"), region_name='ignored',
                                 ibm_api_key_id=api_key
                                 )

    s3_client.head_object(Bucket=bucket, Key=key)

    for th in threading.enumerate():
        stack = "".join(traceback.format_stack(sys._current_frames()[th.ident]))
        print(f"Thread: {th}\n{stack}")

    time.sleep(2)

Example Output

Iteration: 1
Thread: <_MainThread(MainThread, started 4562834944)>
  File "/Users/<user>/workspace/aiops/datalake-hive-table-manager/src/private/real_test.py", line 39, in <module>
    stack = "".join(traceback.format_stack(sys._current_frames()[th.ident]))

Thread: <Thread(Thread-1, started daemon 123145454903296)>
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/<user>/.local/share/virtualenvs/datalake-hive-table-manager-1Zni2EJ6/lib/python3.8/site-packages/ibm_botocore/credentials.py", line 2472, in _background_refresher
    self._background_thread_wakeup_event.wait(new_remaining)
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 558, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)

Thread: <Thread(Thread-2, started daemon 123145471692800)>
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/<user>/.local/share/virtualenvs/datalake-hive-table-manager-1Zni2EJ6/lib/python3.8/site-packages/ibm_botocore/credentials.py", line 2472, in _background_refresher
    self._background_thread_wakeup_event.wait(new_remaining)
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 558, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
KyungmoIBM commented 1 year ago

I also test the code with Python 3.9.14 with ibm-cos-sdk-for-python latest version(2.12.0) and saw same result.

ibm-cos-sdk            2.12.0
ibm-cos-sdk-core       2.12.0
ibm-cos-sdk-s3transfer 2.12.0```

Output


Thread: <_MainThread(MainThread, started 140704422986944)>
  File "/Users/kyungmo/workspace/ibm-cos/cos-python/CS3089342-memory-leak.py", line 29, in <module>
    stack = "".join(traceback.format_stack(

Thread: <Thread(Thread-1, started daemon 123145534840832)>
  File "/Users/kyungmo/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/Users/kyungmo/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/Users/kyungmo/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/kyungmo/.pyenv/versions/3.9.14/lib/python3.9/site-packages/ibm_botocore/credentials.py", line 2472, in _background_refresher
    self._background_thread_wakeup_event.wait(new_remaining)
  File "/Users/kyungmo/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 581, in wait
    signaled = self._cond.wait(timeout)
  File "/Users/kyungmo/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)

Iteration: 1
Thread: <_MainThread(MainThread, started 140704422986944)>
  File "/Users/<user>/workspace/ibm-cos/cos-python/CS3089342-memory-leak.py", line 29, in <module>
    stack = "".join(traceback.format_stack(

Thread: <Thread(Thread-1, started daemon 123145534840832)>
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/site-packages/ibm_botocore/credentials.py", line 2472, in _background_refresher
    self._background_thread_wakeup_event.wait(new_remaining)
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 581, in wait
    signaled = self._cond.wait(timeout)
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)

Thread: <Thread(Thread-2, started daemon 123145551630336)>
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/site-packages/ibm_botocore/credentials.py", line 2472, in _background_refresher
    self._background_thread_wakeup_event.wait(new_remaining)
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 581, in wait
    signaled = self._cond.wait(timeout)
  File "/Users/<user>/.pyenv/versions/3.9.14/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)```
arnabm28 commented 1 year ago

@KyungmoIBM

Can we have a bit more details on what exactly the thread is doing in this case. Breaking down the sample code I can see that after the credentials are been read and client is created, the tread begins to execute.

It is this statement traceback.format_stack(sys._current_frames()[th.ident]) thats throws the error which we see in every iteration. Could you help us here in order to check what exactly is been attempted in here.

Breaking down the code I see the following output after running for two iterations

for th in threading.enumerate():
        print(th)
        print(th.ident)
        print(sys._current_frames)

OUTPUT

Iteration : 0

<_MainThread(MainThread, started 4440643072)>
4440643072
<built-in function _current_frames>
<Thread(Thread-1, started daemon 123145463848960)>
123145463848960
<built-in function _current_frames>
Iteration : 1

<_MainThread(MainThread, started 4440643072)>
4440643072
<built-in function _current_frames>
<Thread(Thread-1, started daemon 123145463848960)>
123145463848960
<built-in function _current_frames>
<Thread(Thread-2, started daemon 123145480638464)>
123145480638464
<built-in function _current_frames>
arnabm28 commented 1 year ago

There are few more findings from our end. The above pointed file (credentials.py) from ibm_botocore is used when generating token. If a generated token expired and new token is generated without waiting the token to expire then background token-refresh-thread will not finish processing it.

For Example :

def wait_till_token_refreshed(token_manager, wait_timeout=60):
    token_manager.wakeup_refresh_thread()
    for _ in range(ceil(wait_timeout / 2.0)):
        if token_manager._background_thread_wakeup_event.isSet():
            # Sleep and give a chance to background_thread to fetch the token.
            # Once the token is fetched, the background_thread clears the event.
            # Don't sleep for too long (>5 secs), because _advisory_refresh_timeout is still close
            # to the new token expiry time, the bg thread might wakeup and fetch a new token again.
            time.sleep(2)
        else:
            break
    else:
        raise Exception("Background token-refresh-thread did not finish processing in %d secs." % wait_timeout)

This is example of a method that can be used after a token is expired.

However in this regard there are few question that we need some clarity on.

  1. What is exactly meant when the user says "When s3_client goes out of scope, the DefaultTokenManager's background_refresher thread stays alive." What is out of scope here. Can you please explain this a bit.
  2. What kind of operation is being done in the thread. The example is not very clean of what is being done.

Can we have a bit more clarity on the what the ask is here if possible a better example to illustrate the memory leak scenario.

Thanks.

arnabm28 commented 1 year ago

@KyungmoIBM Can we have some update on this. This is open for some time and we had some conversation on the same. Will it be possible to have some clarification on this so that we can move forward with this.

Thanks

avinash1IBM commented 12 months ago

We are closing this issue as resolved as there is no further communication. Please create a new issue if you are facing the same problem again.