Open maubarsom opened 11 months ago
Update: I asked a colleague to run this in linux with 13.0.0
, same error occurs, same conditions.
As a workaround, wrapping the read_parquet
call with fsspec
works:
_native_read_parquet = pd.read_parquet
def read_parquet(f, *args, **kwargs):
if isinstance(f, BytesIO):
return _native_read_parquet(f, *args, **kwargs)
kwargs.pop("filesystem", None)
fs = fsspec.open(f).fs
return _native_read_parquet(f, filesystem=fs, *args, **kwargs)
pd.read_parquet = read_parquet
but it's probably slower and more memory hungry.
Hi! thanks for the reply. Maybe it wasn't so clear from my description above, but for pandas
I did find a workaround, which is to supply the storage_options={ "anon": False }
to the pandas.read_parquet()
call (which I took from the s3fs
documentation btw). So something like
import pandas as pd
df = pd.read_parquet("s3://my-bucket/path/to/my_file.parquet", storage_options={"anon":False})
was enough for it to use the credentials successfully :). I'm guessing this workaround is equally performant as without the param.
I reproduced the same bug for pyarrow version 12.0.0
and above all the way to 15.0.2
. I'm on a macOS Sonoma 14.4, Apple M1 Max chip. Rolling pyarrow back to version 11.0.0
fixes it for me, as well as the solution suggested by @maubarsom
Describe the bug, including details regarding any error messages, version, and platform.
Bug reproduced in pyarrow version
10.0.1
,12.0.0
and13.0.0
on macOS Ventura 13.6, Apple M1 Pro. Did not test11.0.0
.Apparently, the was not present in
9.0.0
based on the behaviour of an old environment, but can't install anymore from pip to confirm.Description
The error was detected in
pandas
originally, but traced topyarrow
, as described in the screenshot. Basically, if I try to read an existing file fromS3
when my credentials are stored in the ~/.aws/credentials and config directory, pyarrow returns the error .Expected result: The file is succesfully read
Note: This error DOES NOT occur if the credentials are set as environment variables (instead of being read from ~/.aws/credentials). If they are set as env variables, pyarrow succesfully reads the parquet file.
Note 2: As shown in the screenshot, I managed to circunvent the issue in pandas by passing the
storage_options={"anon":False}
explicitly. However, trying a similar approach inpyarrow
, by setting explicitlyfilesystem=S3Filesystem(anonymous=False)
did not succeed, and resulted in the same error.Additional info
The ~/.aws/credentials file is composed of three keys:
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
andAWS_SESSION_TOKEN
Screenshot
The traceback:
Component(s)
Parquet, Python