RecursionError: Maximum recursion depth exceeded

MustaphaU commented 8 months ago

Describe the bug

I need help with this recursion error maximum recursion depth exceeded from boto3. This occurs when I initialize an s3 client in my inference script to allow me read s3 objects. Your insights will be deeply appreciated! Similar issue was posted on stackoverflow 2 months ago here: https://stackoverflow.com/questions/77786275/aws-sagemaker-endpoint-maximum-recursion-depth-exceeded-error-when-calling-boto

Here is the relevant code block responsible for the error:

def get_video_bytes_from_s3(bucket_name, key):
    s3_client = boto3.client('s3')
    try:
        video_object = s3_client.get_object(Bucket= bucket_name, Key=key)
        video_bytes = video_object['Body'].read()
        return video_bytes
    except Exception as e:
        print(f"Failed to fetch video from S3: {e}")

Expected Behavior

The s3 client created to enable access to the s3 objects

Current Behavior

Here is the full error log:

Traceback (most recent call last):
  File "/sagemaker/python_service.py", line 423, in _handle_invocation_post
    res.body, res.content_type = handlers(data, context)
  File "/opt/ml/model/code/inference.py", line 156, in handler
    video_bytes = get_video_bytes_from_s3(key)
  File "/opt/ml/model/code/inference.py", line 16, in get_video_bytes_from_s3
    s3_client = boto3.client('s3')
  File "/usr/local/lib/python3.10/site-packages/boto3/__init__.py", line 92, in client
    return _get_default_session().client(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/boto3/session.py", line 299, in client
    return self._session.create_client(
  File "/usr/local/lib/python3.10/site-packages/botocore/session.py", line 997, in create_client
    client = client_creator.create_client(
  File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 159, in create_client
    client_args = self._get_client_args(
  File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 490, in _get_client_args
    return args_creator.get_client_args(
  File "/usr/local/lib/python3.10/site-packages/botocore/args.py", line 137, in get_client_args
    endpoint = endpoint_creator.create_endpoint(
  File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 409, in create_endpoint
    http_session = http_session_cls(
  File "/usr/local/lib/python3.10/site-packages/botocore/httpsession.py", line 323, in __init__
    self._manager = PoolManager(**self._get_pool_manager_kwargs())
  File "/usr/local/lib/python3.10/site-packages/botocore/httpsession.py", line 341, in _get_pool_manager_kwargs
    'ssl_context': self._get_ssl_context(),
  File "/usr/local/lib/python3.10/site-packages/botocore/httpsession.py", line 350, in _get_ssl_context
    return create_urllib3_context()
  File "/usr/local/lib/python3.10/site-packages/botocore/httpsession.py", line 139, in create_urllib3_context
    context.options |= options
  File "/usr/local/lib/python3.10/ssl.py", line 620, in options
    super(SSLContext, SSLContext).options.__set__(self, value)
  File "/usr/local/lib/python3.10/ssl.py", line 620, in options
    super(SSLContext, SSLContext).options.__set__(self, value)
  File "/usr/local/lib/python3.10/ssl.py", line 620, in options
    super(SSLContext, SSLContext).options.__set__(self, value)
  [Previous line repeated 479 more times]

Reproduction Steps

simply initializing an s3 client within an inference script like so:

s3_client = boto3.client('s3')

Possible Solution

No response

Additional Information/Context

No response

SDK version used

1.34.55

Environment details (OS name and version, etc.)

Sagemaker endpoint for Tensorflow serving

avishwanathan88 commented 7 months ago

I am facing this maximum recursion depth issue suddenly as well when trying to check if the object exists in the s3 bucket using

s3_client.head_object(Bucket=bucket_name, Key=key)

It used to work before but not sure if something changed suddenly. the s3 client is created using

boto3.client(service_name='s3',
                                 use_ssl=False,
                                 region_name=region,
                                 endpoint_url=endpoint_url,
                                 aws_access_key_id=key_id,
                                 aws_secret_access_key=access_key,
                                 config=Config(
                                    s3={'addressing_style': 'path'},
                                    signature_version='s3v4'))

RyanFitzSimmonsAK commented 6 months ago

Hi @MustaphaU, thanks for reaching out. If you limit the script to only be initializing a client (no actual operations), do you still have this behavior? In other words, what is the minimum reproducible code snippet that produces this recursion depth error? Thanks!

MustaphaU commented 6 months ago

Hi @MustaphaU, thanks for reaching out. If you limit the script to only be initializing a client (no actual operations), do you still have this behavior? In other words, what is the minimum reproducible code snippet that produces this recursion depth error? Thanks!

Hi @RyanFitzSimmonsAK Just initializing the s3 client in my inference script like below is enough to reproduce the error

s3_client = boto3.client('s3')

Thank you.

Edit: The error persists. Apologies for the back and forth. Yes, s3_client=boto3.client('s3') should produce the error. I just tested now and got the error.

MustaphaU commented 6 months ago

@RyanFitzSimmonsAK

Please see the attached below from cloudwatch logs:

Also, see the relevant part of the inference script:

You would observe from the log that execution failed at the point of initializing the s3 client.

Thanks.

pmaoui commented 6 months ago

I also have this bug. One message I got while performing some tests to fix that:

Hope it could help.

As a workaround I used the awscli already present in the container:

import subprocess
subprocess.run(["/usr/local/bin/aws", "s3", "cp", "s3://bucket/file, "/local/file"], check=True)

shresthapradip commented 5 months ago

I am also getting the same error. It was working fine a few weeks ago.

shresthapradip commented 5 months ago

So,

s3.Bucket(settings.S3_BUCKET).put_object(Key=key, Body=file_data)

works, but the following code doesn't. This is a nightmare :)

res = self.s3.put_object(Bucket=settings.S3_BUCKET,
                      Key=key,
                      Body=file_data)

shresthapradip commented 5 months ago

Probably same thing goes to get_object

RyanFitzSimmonsAK commented 5 months ago

Given that you're only seeing this behavior in Sagemaker inference scripts, it's likely not purely a Boto3 problem. I've reached out to the Sagemaker team for more information about this issue, and will update this issue whenever I have more information.

Ticket # for internal use : P133939124

RyanFitzSimmonsAK commented 5 months ago

Neither I nor the service team were able to reproduce this issue. Could you provide the following information?

What Sagemaker image are you using?
Are you following an example notebook?
Are you deploying in a VPC?
Can you provide a minimal inference.py that produces this behavior?

MustaphaU commented 5 months ago

@RyanFitzSimmonsAK Thanks. Not following an example notebook or deploying in a VPC. I have created a repo with instructions to reproduce the issue here: https://github.com/MustaphaU/rerror

deepblue-phoenix commented 5 months ago

seeing this issue as well, except with creating clients for the boto3 secrets_manager

RyanFitzSimmonsAK commented 5 months ago

Hi, just an update. The service team was able to reproduce this behavior, and is working on determining the root cause.

deepblue-phoenix commented 4 months ago

Hi, just an update. The service team was able to reproduce this behavior, and is working on determining the root cause.

this is fantastic news! thank you team! :)

just for external planning and orientation, are there any ideas roughly if this is a high-priority issue or some other level? the bug is exhibiting for us in one of our critical paths. we have a temporary bypass for it but would really like to get back to using boto3 fully.

appreciate the help, and very happy you can reproduce the issue :)

BadSergey87 commented 4 months ago

I was facing the same issue when trying to build a sagemaker serving tenserflow image. adding monkey patch to python_service.py at the very top helped me.

import gevent.monkey
gevent.monkey.patch_all()

this was suggested in stackoverflow thread here: https://stackoverflow.com/questions/45425236/gunicorn-recursionerror-with-gevent-and-requests-in-python-3-6-2

MustaphaU commented 4 months ago

I was facing the same issue when trying to build a sagemaker serving tenserflow image. adding monkey patch to python_service.py at the very top helped me.
import gevent.monkey
gevent.monkey.patch_all()
this was suggested in stackoverflow thread here: https://stackoverflow.com/questions/45425236/gunicorn-recursionerror-with-gevent-and-requests-in-python-3-6-2

Thanks for the suggestion. I had tried this fix but it didn't resolve the issue. I mentioned it here on stackoverflow

BadSergey87 commented 4 months ago

I was facing the same issue when trying to build a sagemaker serving tenserflow image. adding monkey patch to python_service.py at the very top helped me.
import gevent.monkey
gevent.monkey.patch_all()
this was suggested in stackoverflow thread here: https://stackoverflow.com/questions/45425236/gunicorn-recursionerror-with-gevent-and-requests-in-python-3-6-2
Thanks for the suggestion. I had tried this fix but it didn't resolve the issue. I mentioned it here on stackoverflow

You don't clarify it, but have you added it to your model inference code or you built the sagemaker image with it? didn't work for me when I tried it on the inference code. It has to happen prior any other python import happens.

HAdamCode commented 4 months ago

@deepblue-phoenix Would you mind sharing your workaround for this issue?

Does anyone have a solution or a timeline on this?

MustaphaU commented 4 months ago

@deepblue-phoenix Would you mind sharing your workaround for this issue?

Does anyone have a solution or a timeline on this?

You could try the suggestions by @shresthapradip or this workaround by @pmaoui if it applies to your case.

marcoleino commented 4 months ago

Hello, I am having the same issue on a sagemaker custom inference.py script (attached).

I tried both using gevent.monkey.patch_all() and gevent.monkey.patch_all(ssl=False), but the issue persists. I hope there will be a solution soon.

My inference.py :

import gevent.monkey
gevent.monkey.patch_all(ssl=False)

import json
import numpy as np
from PIL import Image
import io
import logging
import tempfile

import boto3

# Configure logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

s3_client = boto3.client('s3')

def open_image(image_data):
    try:
        return Image.open(io.BytesIO(image_data))  # Supports every type of image extension
    except Exception as e:
        logger.error(f"Error opening image: {str(e)}")
        raise

def read_image_from_s3(s3_uri):
    """Load image file from s3.

    Parameters
    ----------
    s3_uri : string
        S3 URI in the form s3://bucket/key

    Returns
    -------
    np.array
        Image array
    """
    try:
        bucket, key = s3_uri.replace("s3://", "").split("/", 1)
        logger.info(f"Parsed bucket: {bucket}, key: {key}")

        logger.info(f"Reading image from bucket: {bucket}, key: {key}")

        s3 = boto3.resource('s3')
        bucket = s3.Bucket(bucket)
        object = bucket.Object(key)
        response = object.get()
        file_stream = response['Body']
        im = Image.open(file_stream)
        image_array = np.array(im)

        logger.info(f"Successfully read image from S3 bucket: {bucket}, key: {key}")
        return image_array
    except Exception as e:
        logger.error(f"Error reading image from S3 bucket: {bucket}, key: {key}, error: {str(e)}")
        raise

def input_handler(data, context):
    """ Pre-process request input before it is sent to TensorFlow Serving REST API

    Args:
        data (obj): the request data stream if images, dict or string if text.
        context (Context): an object containing request and configuration details

    Returns:
        (dict): a JSON-serializable dict that contains request body and headers
    """
    try:
        logger.info(f"Request content type: {context.request_content_type}")

        with tempfile.TemporaryDirectory() as temp_dir:
            logger.info(f"Created temporary directory at {temp_dir}")

            if "image" in context.request_content_type:
                payload = data.read()
                image = open_image(payload)            
                image_array = np.array(image)
                image_with_batch_dim = np.expand_dims(image_array, axis=0)  # Add batch dimension
                # Input format is the same as TF Serving API: https://www.tensorflow.org/tfx/serving/api_rest
                response_payload = json.dumps({"instances": image_with_batch_dim.tolist()})  # tolist preserves the shape [1, 224, 224, 3]
                return response_payload

            elif "json" in context.request_content_type:
                payload = data.read().decode('utf-8')
                json_data = json.loads(payload)

                # Assuming the structure of json_data is {"s3_uris": ["s3://bucket/key1", "s3://bucket/key2", ...]}
                s3_uris = json_data.get("s3_uris", [])
                logger.info(f"Received S3 URIs: {s3_uris}")

                images = []

                for s3_uri in s3_uris:
                    try:
                        image_array = read_image_from_s3(s3_uri)
                        images.append(image_array)
                    except Exception as e:
                        logger.error(f"Failed to process image from S3 URI {s3_uri}: {str(e)}")

                if not images:
                    raise ValueError("No valid images found in the provided S3 URIs.\n Please, provide a json stream with key 's3_uris' and a list of uris as value.")

                images_with_batch_dim = np.stack(images, axis=0)  # Stack images to create a batch
                response_payload = json.dumps({"instances": images_with_batch_dim.tolist()})
                return response_payload

            raise ValueError(f'{{"error": "unsupported content type {context.request_content_type or "unknown"}"}}')
    except Exception as e:
        logger.error(f"Error in input_handler: {str(e)}")
        raise

def output_handler(data, context):
    """Post-process TensorFlow Serving output before it is returned to the client.
    Args:
        data (obj): the TensorFlow serving response as described here: https://www.tensorflow.org/tfx/serving/api_rest#response_format_4
        context (Context): an object containing request and configuration details
    Returns:
        (bytes/json, string): data to return to client, response content type
    """
    try:
        if data.status_code != 200:
            raise ValueError(data.content.decode('utf-8'))

        response_content_type = context.accept_header
        prediction = data.content
        return prediction, response_content_type
    except Exception as e:
        logger.error(f"Error in output_handler: {str(e)}")
        raise

tim-finnigan commented 3 months ago

For those using gevent, there is an issue here being tracked on their side for that: https://github.com/gevent/gevent/issues/1826. This issue appears to be specific to RHEL-based systems. Please note that we do not provide or officially support gevent with our networking setup. Any issues related to gevent will need to be addressed by the gevent team.

PrathameshDa commented 2 months ago

I also faced same issue ,But this can be fix using

import gevent.monkey
gevent.monkey.patch_all()

Thanks, Everyone

ayersb commented 2 months ago

This thread was helpful for debugging this issue, so I'm posting my team's context and solution to this problem.

We encountered this issue after updating packages in a flask application that uses gunicorn to launch "gevent workers" on python 3.10.

The issue appears to have been caused by gevent monkey patching occurring too late after the application python process was started. Gunicorn itself has a built in warning log for this that looks like

/usr/local/lib/python3.10/site-packages/gunicorn/workers/ggevent.py:38: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. It may also silently lead to incorrect behaviour on Python 3.7. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016. Modules that had direct imports (NOT patched): ['urllib3.util.ssl_ (/usr/local/lib/python3.10/site-packages/urllib3/util/ssl_.py)'

We'd seen this warning in the past without it causing problems, but with newly updated packages we ran into this issue when downloads files from s3 using boto3.

Two ways to fix this. One was to follow the advice in this close gunicorn github issue and NOT use a gunicorn.py config file and instead pass configs as params to the gunicorn process in our entrypoint script. The solution we ended up going with was to monkey patch gevent at the start of our config script, which we didn't previously realize ran in the same python process as the workers.

import gevent.monkey

gevent.monkey.patch_all()
# Monkey patching need to happen here before anything else.
# Gunicorn automatically monkey patches the worker processes when using gevent workers.
# But the way it does this does not strongly guarantee that the monkey patching will
# happen before this file loads, which can cause issues with core libraries like SSL.
import multiprocessing  # noqa: E402

Outside of the gunicorn I think there are two paths to try to debug this:

Path 1) You know you're already using gevent to monkey patch

Make sure gevent monkey patching happens BEFORE ANYTHING ELSE. That includes any other package imports. You may need to ignore lint rules.

Make sure what you think is the entrypoint for your application actually is the entrypoint. If some other file loads first the patching may not work correctly. You can try to debug this by adding a print or log line for dir() at the very top of your application. eg


$ python
Python 3.10.13 (main, May 16 2024, 15:17:11) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> dir()
['__annotations__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__']
>>> import multiprocessing
>>> dir()
['__annotations__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'multiprocessing']


Path 2) You don't think you're using gevent at all.
  - Try running the snippet of code below to verify that some 3rd party application isn't using gevent without your knowing it. If something is, search your libs for whatever is causing the problem and either replace the problematic library, try putting it at the very top of your imports, or run gevent monkey patching yourself before importing *ANYTHING*
  - If you're definitely not using gevent at all, then some other bug entirely is causing this issue with boto3

```python
from gevent.monkey import is_module_patched
...
# Place this where it makes sense for your application
if is_module_patched("socket"): # Socket will VERY LIKELY be patched by any lib using gevent
    raise RuntimeError("Gevent was already monkey patched")
else:
    logging.info("Gevent was NOT monkeypatched")

boto / boto3