fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
249 stars 44 forks source link

Cannot use UPath on S3 with pandas: PermissionError/Access Denied #241

Open ba1dr opened 3 months ago

ba1dr commented 3 months ago
import pandas as pd
from upath import UPath

AWS_KEY = "AKIAxxxxxxx"
AWS_SECRET = "xxxxxxxxxxxxxxx"

bucket = 'upathtest'
fkey = f"folder1/folder2/test1.xlsx"
s3base = UPath(f"s3://{bucket}", key=AWS_KEY, secret=AWS_SECRET)
s3path = s3base / fkey

print(list(s3base.iterdir()))      # THIS WORKS!
with s3path.open('w') as ff:
    ff.write("test1,test2")        # THIS WORKS EITHER!

df = pd.DataFrame()
df.to_excel(s3path)           # !! This fails
Traceback

``` Traceback (most recent call last): File "/mypath/venv/lib/python3.11/site-packages/s3fs/core.py", line 113, in _error_wrapper return await func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/aiobotocore/client.py", line 411, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/mypath/try_fss.py", line 57, in main() File "/mypath/try_fss.py", line 53, in main test03() File "/mypath/try_fss.py", line 47, in test03 pd.read_csv(s3path) File "/mypath/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv return _read(filepath_or_buffer, kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 620, in _read parser = TextFileReader(filepath_or_buffer, **kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__ self._engine = self._make_engine(f, self.engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine self.handles = get_handle( ^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/pandas/io/common.py", line 728, in get_handle ioargs = _get_filepath_or_buffer( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/pandas/io/common.py", line 443, in _get_filepath_or_buffer ).open() ^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/fsspec/core.py", line 147, in open return self.__enter__() ^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/fsspec/core.py", line 105, in __enter__ f = self.fs.open(self.path, mode=mode) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/fsspec/spec.py", line 1303, in open f = self._open( ^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/s3fs/core.py", line 689, in _open return S3File( ^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/s3fs/core.py", line 2183, in __init__ super().__init__( File "/mypath/venv/lib/python3.11/site-packages/fsspec/spec.py", line 1742, in __init__ self.size = self.details["size"] ^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/fsspec/spec.py", line 1755, in details self._details = self.fs.info(self.path) ^^^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper return sync(self.loop, func, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync raise return_result File "/mypath/venv/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner result[0] = await coro ^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/s3fs/core.py", line 1375, in _info out = await self._call_s3( ^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/s3fs/core.py", line 366, in _call_s3 return await _error_wrapper( ^^^^^^^^^^^^^^^^^^^^^ File "/mypath/venv/lib/python3.11/site-packages/s3fs/core.py", line 145, in _error_wrapper raise err PermissionError: Forbidden ```

I tried to use client_kwargs - this does not work either.

aioboto_client_kwargs = {
    'aws_access_key_id': AWS_KEY,
    'aws_secret_access_key': AWS_SECRET,
}
s3base = UPath(f"s3://{bucket}", client_kwargs=aioboto_client_kwargs)
...
# same error

AWS user has AmazonS3FullAccess policy attached.

ap-- commented 3 months ago

Thank you for opening the issue.

The implementation in pandas.io.common of _get_filepath_or_buffer basically converts the provided UPath instance into a string and drops the storage_options. This causes pandas to then try to interpret the returned s3 uri without the storage options.

The reason for this happening is that UPath incorrectly pretends to be local path, which is going to be fixed when we move the correct base class: PathBase which is not going to provide a __fspath__ dunder anymore for non-local paths.

In the future we could also try to add support for arbitrary PathBase subclasses in pandas. But at least for universal_pathlib the mentioned changes in UPath should happen first.

All that being said, you can either provide the buffer as you've done in the with context directly to .to_excel() or provide the storage_options explicitly as shown here:

import pandas as pd
from upath import UPath

pth = UPath(f"s3://some-bucket/some-file", key=..., secret=...)

df = pd.DataFrame()
df.to_excel(pth, storage_options=pth.storage_options)   

Let me know if that helps, Andreas

ba1dr commented 3 months ago

Thank you for the answer. However, this does not help much, as the idea was in simply replacing the Path objects to UPath, without changing it everywhere. I am refactoring a big piece of code and was hoping this could help to transparently work with any path objects.

ap-- commented 3 months ago

Given the current implementation in pandas, and the current implementation in universal_pathlib, what you can do to achieve what you're asking for is to not provide credentials explicitly, but set the credentials via any of the supported methods for s3fs described here: https://s3fs.readthedocs.io/en/latest/#credentials

I also recommend to subscribe to #193 to be notified once work starts to move UPath to its correct base class available in future versions of stdlib pathlib (and backported in pathlib-abc)