aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.88k stars 681 forks source link

Timeout when using add_parquet_partitions() #2899

Closed nelyanne-v closed 1 month ago

nelyanne-v commented 1 month ago

Describe the bug

Hi all,

I have an AWS Lambda function that is calling add_parquet_partitions() function to add new partitions to tables in my Glue catalogue on the daily basis. I am currently trying to upgrade Lambda from using Python 3.8 + awsdatawrangler 1.6 to Python 3.12 + awsdatawrangler 3.9.0. I was able to test the upgraded Lambda by using a local invocation, and it worked without issues.

However, after I deployed the Lambda on AWS, it always gets stuck on adding a partition for a table. Normally, it was taking ~70s to process all my tables, but now even 15mins is not enough. I'm not getting an explicit error message, and I am not able to get it because of the Lambda max running time limit. I observed that this happens when adding a partition to the 3rd table in a loop. It doesn't seem to be an issue with a single particular table. For example, if the first run fails on table 3, and I force the second run to start from table 3, it will initially succeed for tables 3 & 4, but then it will fail on table 5.

I'm calling the function using only the required arguments. I checked the input parameters and there is nothing wrong I can see about the values.

wr.catalog.add_parquet_partitions(database=event['database'],
table=table['Name'],
partitions_values=partitions)

I already have a Lambda layer dedicated for awsdatawrangler 3.9.0, and it's used in my other upgraded Lambdas without issues. Nothing else has changed about my deployment process. The Lambda doesn't have other dependencies (only awsdatawrangler).

Any idea how I can investigate it further? I'd be grateful for any pointers.

How to Reproduce

  1. Write a loop the iterates through tables in Glue metadata catalogue on AWS.
  2. For each table, try to add a partition for a today date, using add_parquet_partitions()

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

AWS Lambda, x86_64

Python version

3.12.4

AWS SDK for pandas version

3.9.0

Additional context

No response

jaidisido commented 1 month ago

I already have a Lambda layer dedicated for awsdatawrangler 3.9.0

Just to clarify, does this mean you have created your own layer or are you using the one we publish?

The only pointer that springs to mind for me is to increase the memory of the Lambda function. Newer versions require more memory due to their larger dependencies size

nelyanne-v commented 1 month ago

Sorry it was not clear, I've been using the published layer.

I managed to fix the issue by doing these 2 things, maybe it will help someone:

def handler(event, context): ...

* I'm using it as a value for `boto3_session` parameter

wr.catalog.add_parquet_partitions(database=event['database'], table=table['Name'], partitions_values=partitions, boto3_session=session)