coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

Use scan_parquet over scan_dataset for polars #1394

Closed phofl closed 7 months ago

phofl commented 7 months ago

Yikes...

hendrikmakait commented 7 months ago

When running this from my machine, this throws FAILED tests/tpch/test_polars.py::test_query_1 - polars.exceptions.ComputeError: Generic S3 error: Client error with status 403 Forbidden: No Body.

phofl commented 7 months ago

@ritchie46

We are a bit confused, the following doesn't work for us:

import polars as pl
import boto3

session = boto3.session.Session()
credentials = session.get_credentials()

pl.scan_parquet(
            "s3://coiled-runtime-ci/tpc-h/snappy/scale-1000/lineitem/*.parquet",
            storage_options={
                "aws_access_key_id": credentials.access_key,
                "aws_secret_access_key": credentials.secret_key,
                "aws_region": "us-east-2",
            },
        )
Traceback (most recent call last):
  File "/Users/patrick/Library/Application Support/JetBrains/PyCharm2023.3/scratches/dask_expr_scratch.py", line 186, in <module>
    pl.scan_parquet(
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.12/site-packages/polars/utils/deprecation.py", line 136, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.12/site-packages/polars/utils/deprecation.py", line 136, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.12/site-packages/polars/io/parquet/functions.py", line 311, in scan_parquet
    return pl.LazyFrame._scan_parquet(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/patrick/mambaforge/envs/dask-expr/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 464, in _scan_parquet
    self._ldf = PyLazyFrame.new_from_parquet(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: Generic S3 error: Error performing list request: Client error with status 403 Forbidden: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message><AWSAccessKeyId>***</AWSAccessKeyId><RequestId>NZEXF3ABC6078JG2</RequestId><HostId>***</HostId></Error>

2 problems:

Am I doing anything wrong, e.g. missing a variable in storage options or something similar?

ritchie46 commented 7 months ago

The Polars error message prints the access key and secret (I replaced it with * here), that's not great from a security perspective

Hmm.. No it isn't. Will see if this can be fixed upstream in Object-store (which is what we use for s3 access).

That seems strange. It must be the credentials though. I can access private s3 repos.

These are the config keys we support: https://docs.rs/object_store/0.9.0/object_store/aws/enum.AmazonS3ConfigKey.html

Could you also set POLARS_VERBOSE=1? That might show a bit more.

ntabris commented 7 months ago

@phofl do you need to pass aws session token as well? (if you're using your standard coiled employee aws creds, I think it's likely you do)

phofl commented 7 months ago

Yep adding session token solved this problem, thx!

Sorry for the noise @ritchie46, I can now access the files, so it seems to work.

Then there is only the secret issue, but that should be covered by the issue that you've opened, thx for that

ritchie46 commented 7 months ago

Yeah, looking into that. I believe the client id isn't really secret.