boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.
Apache License 2.0
1.47k stars 1.07k forks source link

very rare ReferenceError #2970

Open tsah-alike opened 1 year ago

tsah-alike commented 1 year ago

Describe the bug

We are running python 3.8 on AWS lambda. We use boto3. The code is patched by aws-xray-sdk and lumigo tracer. Very rarely (every few months) we encounter a ReferenceError. This will happen again and again as long as the same instance of lambda is reused.

We could not find a way to reproduce it. All we have is the stack trace.

Expected Behavior

Should not raise ReferenceError.

Current Behavior

ReferenceError is raised

This particular one happened during PutItem on a DynamoDB Table object.

[ERROR] ReferenceError: weakly-referenced object no longer exists application part of stack trace table_handler.update_item( File "/var/runtime/boto3/resources/factory.py", line 580, in do_action response = action(self, *args, kwargs) File "/var/runtime/boto3/resources/action.py", line 88, in call response = getattr(parent.meta.client, operation_name)(*args, *params) File "/opt/python/botocore/client.py", line 530, in _api_call return self._make_api_call(operation_name, kwargs) File "/opt/python/wrapt/wrappers.py", line 644, in call return self._self_wrapper(self.wrapped, self._self_instance, File "/opt/python/aws_xray_sdk/ext/botocore/patch.py", line 38, in _xray_traced_botocore return xray_recorder.record_subsegment( File "/opt/python/aws_xray_sdk/core/recorder.py", line 462, in record_subsegment six.raise_from(exc, exc) File "", line 3, in raise_from File "/opt/python/aws_xray_sdk/core/recorder.py", line 457, in record_subsegment return_value = wrapped(args, kwargs) File "/opt/python/botocore/client.py", line 943, in _make_api_call http, parsed_response = self._make_request( File "/opt/python/botocore/client.py", line 966, in _make_request return self._endpoint.make_request(operation_model, request_dict) File "/opt/python/botocore/endpoint.py", line 119, in make_request return self._send_request(request_dict, operation_model) File "/opt/python/botocore/endpoint.py", line 198, in _send_request request = self.create_request(request_dict, operation_model) File "/opt/python/botocore/endpoint.py", line 134, in create_request self._event_emitter.emit( File "/opt/python/botocore/hooks.py", line 412, in emit return self._emitter.emit(aliased_event_name, kwargs) File "/opt/python/botocore/hooks.py", line 256, in emit return self._emit(event_name, kwargs) File "/opt/python/botocore/hooks.py", line 239, in _emit response = handler(kwargs) File "/opt/python/botocore/signers.py", line 105, in handler return self.sign(operation_name, request) File "/opt/python/botocore/signers.py", line 149, in sign signature_version = self._choose_signer( File "/opt/python/botocore/signers.py", line 219, in _choose_signer handler, response = self._event_emitter.emit_until_response(

Reproduction Steps

I'm sorry, we did not manage to reproduce this.

Possible Solution

It seems like the RequestSigner class holds a weak reference to some object, but the case of that object being GCd is not dealt with. to fix, surround the expression in botocore/signers.py", line 219 with a try/catch block, and handle the case of ReferenceError

Additional Information/Context

We are running python 3.8 on AWS lambda. We use the official runtime. We use boto3. The code is patched by aws-xray-sdk and lumigo tracer.

SDK version used

unknown, included with python3.8 AWS Lambda runtime

Environment details (OS name and version, etc.)

python3.8 lambda runtime, intel processor

tim-finnigan commented 1 year ago

Thanks @tsah-alike for reaching out. Which version of botocore are you using? Can you share any code snippets that resulted in this error?

I think it would be worth opening an issue directly with the aws-xray-sdk-python repository for this.

tsah-alike commented 1 year ago

Thanks for responding @tim-finnigan, I'll open an issue there as well. The version is unknown since it's coming from the AWS Lambda runtime. My guess is it's pretty recent but not the most recent.

tsah-alike commented 1 year ago

We first noticed this bug about a year ago.

tim-finnigan commented 1 year ago

Hi @tsah-alike thanks for following up. Per the documentation on Lambda runtimes the packaged botocore version would be botocore-1.29.90. And it looks like aws-xray-sdk-python accepts versions going back as far as 1.11.3. You could confirm your version by checking the logs (adding boto3.set_stream_logger('') to your script) or just importing and printing it:

import botocore
print(botocore.__version__)

I'll link the related issue you created in the other repository: https://github.com/aws/aws-xray-sdk-python/issues/394

If you can share any other details such as code snippets or steps to reproduce then that may help narrow down the issue.

tsah-alike commented 1 year ago

The version is 1.29.156. We couldn't create a minimal working example. The last failure was something like this (simplified):

        config = Config(connect_timeout=1, read_timeout=5, retries={'max_attempts': 3})
        session = boto3.Session()
        resource = session.resource('dynamodb', config=config)
       ... business logic ...
       res = resource.query(KeyConditionExpression=Key('post_id').eq(post_id), IndexName='post_id')
       ... business logic ...
       resource.update_item(
                    Key=key,
                    UpdateExpression='set is_deleted = :is_deleted',
                    ExpressionAttributeValues={':is_deleted': True}
        )
      ^^^ ReferenceError is thrown here

This worked perfectly for months, but once that ReferenceError was thrown, the same lambda failed the exact same way 266 times, even though the botocore session and the DDB resource are recreated each time. Once the lambda instance was replaced, it stopped happening, and it worked fine ever since (last week).

tsah-alike commented 1 year ago

I know it's not a lot of information. I did try my best to find steps to reproduce.

StickStack commented 4 months ago

Encountered same issue implementing https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/LambdaRedis.step2.html and running a Python3.10 lambda. Locally on my mac I do not get this issue unless it is an async function.

l0x commented 1 month ago

I have encountered this issue also. Also using the steps above for signing requests to auth on ElastiCache.

Digging a tiny bit into the source, I notice this, in the RequestSigner init method:

# We need weakref to prevent leaking memory in Python 2.6 on Linux 2.6
self._event_emitter = weakref.proxy(event_emitter)

Which looks like it could be the culprit (though I suspect there is more subtlety going on here that I am not aware of from my extremely cursory reading).

I am wondering how safe it is, to just replace this weakref with a normal reference, as I am not using python2.6 on Linux 2.6 - I am going to give this a shot and see if it leads to a memory leak in my usecase, and report back.

Hopefully the above can be a jumping off point for a more qualified boto developer to pick this up and have a look what's going on.

foo-up commented 2 weeks ago

Encountered same issue implementing https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/LambdaRedis.step2.html and running a Python3.10 lambda. Locally on my mac I do not get this issue unless it is an async function.

I'm using an adapted sample code in Python 3.12 lambda deployed with chalice 1.31.2. I set the redis_client outside the handler and when I try to use it in the deployed environment, I get this error:

ReferenceError: weakly-referenced object no longer exists
Traceback (most recent call last):
  File "/var/task/chalice/app.py", line 1762, in __call__
    return self.handler(event_obj)
  File "/var/task/app.py", line 119, in periodic_task
    some_func1()
  File "/var/task/app.py", line 113, in some_func1
    upsert_elasticache(my_list)
  File "/var/task/app.py", line 105, in upsert_elasticache
    redis_client.set(my_key, my_value)
  File "/var/task/redis/commands/core.py", line 2333, in set
    return self.execute_command("SET", *pieces, **options)
  File "/var/task/redis/client.py", line 545, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/var/task/redis/connection.py", line 1074, in get_connection
    connection.connect()
  File "/var/task/redis/connection.py", line 289, in connect
    self.on_connect()
  File "/var/task/redis/connection.py", line 330, in on_connect
    auth_args = cred_provider.get_credentials()
  File "/var/task/cachetools/__init__.py", line 741, in wrapper
    v = func(*args, **kwargs)
  File "/var/task/app.py", line 52, in get_credentials
    signed_url = self.request_signer.generate_presigned_url(
  File "/var/task/botocore/signers.py", line 349, in generate_presigned_url
    self.sign(
  File "/var/task/botocore/signers.py", line 149, in sign
    signature_version = self._choose_signer(
  File "/var/task/botocore/signers.py", line 231, in _choose_signer
    handler, response = self._event_emitter.emit_until_response(