apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.16k stars 3.45k forks source link

pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials #37888

Open maubarsom opened 11 months ago

maubarsom commented 11 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Bug reproduced in pyarrow version 10.0.1, 12.0.0 and 13.0.0 on macOS Ventura 13.6, Apple M1 Pro. Did not test 11.0.0.

Apparently, the was not present in 9.0.0 based on the behaviour of an old environment, but can't install anymore from pip to confirm.

Description

The error was detected in pandas originally, but traced to pyarrow, as described in the screenshot. Basically, if I try to read an existing file from S3 when my credentials are stored in the ~/.aws/credentials and config directory, pyarrow returns the error .

OSError: When getting information for key 'XXX/YYY.parquet' in bucket 'ZZZZZ': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

Expected result: The file is succesfully read

Note: This error DOES NOT occur if the credentials are set as environment variables (instead of being read from ~/.aws/credentials). If they are set as env variables, pyarrow succesfully reads the parquet file.

Note 2: As shown in the screenshot, I managed to circunvent the issue in pandas by passing the storage_options={"anon":False} explicitly. However, trying a similar approach in pyarrow, by setting explicitly filesystem=S3Filesystem(anonymous=False) did not succeed, and resulted in the same error.

Additional info

The ~/.aws/credentials file is composed of three keys:

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN

Screenshot

pyarrow_bug_report

The traceback:

File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/parquet/core.py:2939, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
   2932     raise ValueError( 
   2933         "The 'metadata' keyword is no longer supported with the new "
   2934         "datasets-based implementation. Specify "
   2935         "'use_legacy_dataset=True' to temporarily recover the old "
   2936         "behaviour."
   2937     )
   2938 try:
-> 2939     dataset = _ParquetDatasetV2(
   2940         source,
   2941         schema=schema,
   2942         filesystem=filesystem,
   2943         partitioning=partitioning,
   2944         memory_map=memory_map,
   2945         read_dictionary=read_dictionary,
   2946         buffer_size=buffer_size,
   2947         filters=filters,
   2948         ignore_prefixes=ignore_prefixes,
   2949         pre_buffer=pre_buffer,
   2950         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
   2951         thrift_string_size_limit=thrift_string_size_limit,
   2952         thrift_container_size_limit=thrift_container_size_limit,
   2953     )
   2954 except ImportError:
   2955     # fall back on ParquetFile for simple cases when pyarrow.dataset
   2956     # module is not available
   2957     if filters is not None:

File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/parquet/core.py:2465, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
   2463     except ValueError:
   2464         filesystem = LocalFileSystem(use_mmap=memory_map)
-> 2465 finfo = filesystem.get_file_info(path_or_paths)
   2466 if finfo.is_file:
   2467     single_file = path_or_paths

File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/_fs.pyx:571, in pyarrow._fs.FileSystem.get_file_info()

File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()

Component(s)

Parquet, Python

maubarsom commented 11 months ago

Update: I asked a colleague to run this in linux with 13.0.0, same error occurs, same conditions.

rdbisme commented 10 months ago

As a workaround, wrapping the read_parquet call with fsspec works:

_native_read_parquet = pd.read_parquet

def read_parquet(f, *args, **kwargs):
    if isinstance(f, BytesIO):
        return _native_read_parquet(f, *args, **kwargs)

    kwargs.pop("filesystem", None)
    fs = fsspec.open(f).fs
    return _native_read_parquet(f, filesystem=fs, *args, **kwargs)

pd.read_parquet = read_parquet

but it's probably slower and more memory hungry.

maubarsom commented 10 months ago

Hi! thanks for the reply. Maybe it wasn't so clear from my description above, but for pandas I did find a workaround, which is to supply the storage_options={ "anon": False } to the pandas.read_parquet() call (which I took from the s3fs documentation btw). So something like

import pandas as pd

df = pd.read_parquet("s3://my-bucket/path/to/my_file.parquet", storage_options={"anon":False})

was enough for it to use the credentials successfully :). I'm guessing this workaround is equally performant as without the param.

afonso-stuart commented 4 months ago

I reproduced the same bug for pyarrow version 12.0.0 and above all the way to 15.0.2. I'm on a macOS Sonoma 14.4, Apple M1 Max chip. Rolling pyarrow back to version 11.0.0 fixes it for me, as well as the solution suggested by @maubarsom