aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.84k stars 678 forks source link

Intermittent NETWORK_CONNECTION Error During s3.read_parquet_table Operation #2847

Open DimitarSirakov opened 3 weeks ago

DimitarSirakov commented 3 weeks ago

Describe the bug

Hi,

I'm encountering an intermittent issue when using the s3.read_parquet_table function in my ETL pipeline. The pipeline reads Parquet files from S3 every 5 minutes (modin, ray, awswrangler). Occasionally, I receive the following error:

AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

How to Reproduce

I am unable to reproduce this error consistently, and it seems to resolve itself after some time. import awswrangler as wr

df = wr.s3.read_parquet_table(table,database,partition_filter, filename_suffix)

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.10.13

AWS SDK for pandas version

3.7.2

Additional context

No response

jaidisido commented 3 weeks ago

There is a long standing issue opened in https://github.com/ray-project/ray/issues/43803 on the subject