aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.93k stars 698 forks source link

Allow support for paths other than s3:// #558

Closed SeanBarry closed 3 years ago

SeanBarry commented 3 years ago

aws-data-wrangler version: 2.4.0 (with no modifications)

As part of my local development and testing, and also my CI development and testing, I'm using Localstack to mock AWS s3. This allows me to simulate putting, listing and getting objects from s3 for example.

My codebase is a mix of Node.js and Python. The Node.js code that is interacting with localstack works fine, as I can specify an endpoint when I initiate the s3 client. This endpoint is an env var, so locally and in CI it points to Localstack, but obviously in prod/dev clusters it points to s3://.

Unfortunately, it seems there's no way to override the s3:// path in AWS data-wrangler.

For example, when I call wr.s3.read_parquet with the path pointing to my Localstack s3 bucket, I get the following error:

raise exceptions.InvalidArgumentValue(f"'{path}' is not a valid path. It MUST start with 's3://'")
awswrangler.exceptions.InvalidArgumentValue: 'http://localhost:4566/<redacted>' is not a valid path. It MUST start with 's3://'

I've had a quick check of the src code of data-wrangler to see if there's an override, but haven't found one. The util that throws this error: parse_path() strictly checks the path begins with s3:// and doesn't account for any override.

Describe the solution you'd like It would be incredibly useful if this check either didn't exist or if there was a way to pass an override when creating the datawrangler client. This way I can continue to reliably mock AWS infrastructure locally.

Reproduce

 df = wr.s3.read_parquet(
        path="http://localhost:4566/my-bucket/",
        path_suffix="data.parquet"
)

> raise exceptions.InvalidArgumentValue(f"'{path}' is not a valid path. It MUST start with 's3://'")
awswrangler.exceptions.InvalidArgumentValue: 'http://localhost:4566/my-bucket/' is not a valid path. It MUST start with 's3://'
igorborgest commented 3 years ago

Hi @SeanBarry, thanks for reaching out.

Did you tested our support for custom endpoints through global configurations?

Example:

wr.config.s3_endpoint_url = YOUR_ENDPOINT

OR you can define it through the environment variables:

export WR_S3_ENDPOINT_URL=YOUR_ENDPOINT

All endpoints available are:

image

Some resources:

SeanBarry commented 3 years ago

Hi Igor, thanks for the reply. I can confirm that neither of the following options work - the same util parse_path is executed against them which explicitly checks for s3:// in the URL:

wr.config.s3_endpoint_url = YOUR_ENDPOINT
export WR_S3_ENDPOINT_URL=YOUR_ENDPOINT
igorborgest commented 3 years ago

The idea would be to use a regular s3 path pattern instead of http://localhost:4566/my-bucket/.

My suggestion is to configure the ENDPOINT with your localstack url and then use your mocked bucket the same way as a normal bucket s3://my-bucket/.

igorborgest commented 3 years ago

Closing due the lack of interactions.

Ritish-Madan commented 2 years ago

Hi @igorborgest I am using the endpoint like s3a://, it still gives me the error due to the explicit check for s3://

samuelefiorini commented 2 years ago

Hi @SeanBarry, thanks for reaching out.

Did you tested our support for custom endpoints through global configurations?

Example:

wr.config.s3_endpoint_url = YOUR_ENDPOINT

OR you can define it through the environment variables:

export WR_S3_ENDPOINT_URL=YOUR_ENDPOINT

All endpoints available are:

image

Some resources:

Hi @igorborgest, it looks like timestream endpoint is not currently supported. Any plans to add it in the near future?

Cheers

samuelefiorini commented 2 years ago

Hi @SeanBarry, thanks for reaching out. Did you tested our support for custom endpoints through global configurations? Example:

wr.config.s3_endpoint_url = YOUR_ENDPOINT

OR you can define it through the environment variables:

export WR_S3_ENDPOINT_URL=YOUR_ENDPOINT

All endpoints available are: image Some resources:

Hi @igorborgest, it looks like timestream endpoint is not currently supported. Any plans to add it in the near future?

Cheers

Meanwhile a dedicated issue has been opened #1414