apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.63k stars 3.56k forks source link

[Python] AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 6, Couldn't resolve host name #40539

Open anjali-chadha opened 8 months ago

anjali-chadha commented 8 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Hi there!

We are using the PyArrow library to read files from an S3 bucket, and we're encountering an intermittent error:

OSError: When reading information for key '<REDACTED>' in bucket '<REDACTED>': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 6, Couldn't resolve host name

Please note that this error doesn't occur consistently, and the S3 bucket path is valid.

The reference code we're using is as follows:

import pyarrow as pa
import pyarrow.json as pj

uri = "s3://my-bucket/my-prefix/foo.json"

fs, path = pa.fs.FileSystem.from_uri(uri)  

with fs.open_input_file(path) as f:
       tbl = pj.read_json(f)

Error Details:

2024-03-12T13:06:05.616-07:00    [7]: with fs.open_input_file(path) as f:

2024-03-12T13:06:05.616-07:00    [7]: File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file

2024-03-12T13:06:05.616-07:00    [7]: File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status

2024-03-12T13:06:05.616-07:00    [7]: File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status

2024-03-12T13:06:05.616-07:00    [7]:OSError: When reading information for key '<REDACTED>' in bucket '<REDACTED>': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 6, Couldn't resolve host name

Could you please provide any suggestions on how to handle such intermittent network connectivity errors while reading from S3?

Component(s)

Parquet, Python

pitrou commented 8 months ago

Are you using AWS or do you set endpoint_override? "Couldn't resolve host name" probably means you're having issues with DNS resolution of the S3 server(s) hostnames, which is quite unexpected with AWS...

anjali-chadha commented 8 months ago

@pitrou Yes, we are using AWS, and not explicitly overriding endpoint_override

you're having issues with DNS resolution of the S3 server(s) hostnames, which is quite unexpected with AWS...

We've only encountered this problem once among our numerous runs, so it's not a frequent occurrence. Currently, our approach to dealing with this issue is by increasing the default number of S3 retry attempts from 3 to a higher value.

if isinstance(fs, S3FileSystem):
    fs = pa.fs.S3FileSystem(
        region=fs.region, retry_strategy=AwsStandardS3RetryStrategy(max_attempts=6)
    )

However, we're uncertain if this is the most effective approach.

Do you have any recommendations on how we can better handle this on the client side?

pitrou commented 8 months ago

Sorry, I don't have any recommandation. If increasing max_attempts works, then it seems ok to me.

antonioalegria commented 2 weeks ago

I'm facing this issue when doing a scan_iceberg operation in Polars. It only happens with certain objects:

When reading information for key '' in bucket '': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 6, Couldn't resolve host name Traceback (most recent call last):

File "", line 351, in scan_table return pl.scan_iceberg(location) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.12/site-packages/polars/io/iceberg.py", line 143, in scan_iceberg source = StaticTable.from_metadata( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 1693, in from_metadata metadata = FromInputFile.table_metadata(file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.12/site-packages/pyiceberg/serializers.py", line 113, in table_metadata with input_file.open() as input_stream: ^^^^^^^^^^^^^^^^^ File "/.venv/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py", line 270, in open input_file = self._filesystem.open_input_file(self._path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 789, in pyarrow._fs.FileSystem.open_input_file File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When reading information for key '' in bucket '': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 6, Couldn't resolve host name ```