Allow support for paths other than s3://

SeanBarry commented 3 years ago

aws-data-wrangler version: 2.4.0 (with no modifications)

As part of my local development and testing, and also my CI development and testing, I'm using Localstack to mock AWS s3. This allows me to simulate putting, listing and getting objects from s3 for example.

My codebase is a mix of Node.js and Python. The Node.js code that is interacting with localstack works fine, as I can specify an endpoint when I initiate the s3 client. This endpoint is an env var, so locally and in CI it points to Localstack, but obviously in prod/dev clusters it points to s3://.

Unfortunately, it seems there's no way to override the s3:// path in AWS data-wrangler.

For example, when I call wr.s3.read_parquet with the path pointing to my Localstack s3 bucket, I get the following error:

raise exceptions.InvalidArgumentValue(f"'{path}' is not a valid path. It MUST start with 's3://'")
awswrangler.exceptions.InvalidArgumentValue: 'http://localhost:4566/<redacted>' is not a valid path. It MUST start with 's3://'

I've had a quick check of the src code of data-wrangler to see if there's an override, but haven't found one. The util that throws this error: parse_path() strictly checks the path begins with s3:// and doesn't account for any override.

Describe the solution you'd like It would be incredibly useful if this check either didn't exist or if there was a way to pass an override when creating the datawrangler client. This way I can continue to reliably mock AWS infrastructure locally.

Reproduce

 df = wr.s3.read_parquet(
        path="http://localhost:4566/my-bucket/",
        path_suffix="data.parquet"
)

> raise exceptions.InvalidArgumentValue(f"'{path}' is not a valid path. It MUST start with 's3://'")
awswrangler.exceptions.InvalidArgumentValue: 'http://localhost:4566/my-bucket/' is not a valid path. It MUST start with 's3://'

igorborgest commented 3 years ago

Hi @SeanBarry, thanks for reaching out.

Did you tested our support for custom endpoints through global configurations?

Example:

wr.config.s3_endpoint_url = YOUR_ENDPOINT

OR you can define it through the environment variables:

export WR_S3_ENDPOINT_URL=YOUR_ENDPOINT

All endpoints available are:

Some resources:

SeanBarry commented 3 years ago

Hi Igor, thanks for the reply. I can confirm that neither of the following options work - the same util parse_path is executed against them which explicitly checks for s3:// in the URL:

wr.config.s3_endpoint_url = YOUR_ENDPOINT

export WR_S3_ENDPOINT_URL=YOUR_ENDPOINT

igorborgest commented 3 years ago

The idea would be to use a regular s3 path pattern instead of http://localhost:4566/my-bucket/.

My suggestion is to configure the ENDPOINT with your localstack url and then use your mocked bucket the same way as a normal bucket s3://my-bucket/.

igorborgest commented 3 years ago

Closing due the lack of interactions.

Ritish-Madan commented 2 years ago

Hi @igorborgest I am using the endpoint like s3a://, it still gives me the error due to the explicit check for s3://

samuelefiorini commented 2 years ago

Hi @SeanBarry, thanks for reaching out.

Did you tested our support for custom endpoints through global configurations?

Example:
wr.config.s3_endpoint_url = YOUR_ENDPOINT
OR you can define it through the environment variables:
export WR_S3_ENDPOINT_URL=YOUR_ENDPOINT
All endpoints available are:

Some resources:

Ability to pass through endpoint_urls to AWS services #418

https://github.com/awslabs/aws-data-wrangler/blob/main/tutorials/021%20-%20Global%20Configurations.ipynb

Hi @igorborgest, it looks like timestream endpoint is not currently supported. Any plans to add it in the near future?

Cheers

samuelefiorini commented 2 years ago

Hi @SeanBarry, thanks for reaching out. Did you tested our support for custom endpoints through global configurations? Example:
wr.config.s3_endpoint_url = YOUR_ENDPOINT
OR you can define it through the environment variables:
export WR_S3_ENDPOINT_URL=YOUR_ENDPOINT
All endpoints available are: Some resources:

Ability to pass through endpoint_urls to AWS services #418

https://github.com/awslabs/aws-data-wrangler/blob/main/tutorials/021%20-%20Global%20Configurations.ipynb
Hi @igorborgest, it looks like timestream endpoint is not currently supported. Any plans to add it in the near future?

Cheers

Meanwhile a dedicated issue has been opened #1414

aws / aws-sdk-pandas

Allow support for paths other than s3:// #558