apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.56k forks source link

[Python] unexpected URL encoded path (white spaces) when uploading to S3 #34905

Open svenatarms opened 1 year ago

svenatarms commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

Environment

OS: Windows/Linux Python: 3.10.10 s3fs: from 2022.7.1 to 2023.3.0 (doesn't matter) S3 Backend: MinIO / Ceph (doesn't matter)

Description

Version 11.0.0 of pyarrow introduced an unexpected behavior when uploading Parquet Files to an S3 Bucket (using s3fs.S3FileSystem), if the Path to the Parque File contains white spaces. White Spaces will be replaced by URL encoded Syntax %20 e.g: A Directory Name like:

product=My Fancy Product

becomes

product=My%20Fancy%20Product

on S3 filesystem. NOTICE: the Equal Sign = is URL encoded for the request, but won't become %3D on S3 filesystem. That means, the URL encoded equal sign = seems to be interpreted correctly

Example Code

# s3fs FileSystem Object
def return_s3filesystem(url, user, pw):
    fs = s3fs.S3FileSystem(
        anon=False,
        use_ssl=True,
        client_kwargs={
            "endpoint_url": url,
            "aws_access_key_id": user,
            "aws_secret_access_key": pw,
            "verify": False,
        }
    )
    return fs

def write_df_to_s3(df, partition_cols, path_to_s3_object, url, user, pw, more_than_one_date_per_file,
                   delete_parquet_files):
   '''
    write Parquet File from Pandas DataFrame to S3 Bucket
   '''

   # instantiate s3fs.S3FileSystem object
    fs = return_s3filesystem(url, user, pw)
    # if the parquet file allready exists, delete it if requested, to prevent duplicated data
    delete_if_exists(fs, path_to_s3_object, df, more_than_one_date_per_file, delete_existing_files=delete_parquet_files)
    try:
       # create ArrowTable from DataFrame
        arrow_table = Table.from_pandas(df)
    except ArrowTypeError as e:
        # this is Error No. 1626701451158
        raise InvalidDataFrame(errorno=1626701451158, dataframe=df, arrowexception=e)
    except TypeError as e:
        raise InvalidDataFrame(errorno=1627657641211, dataframe=df, arrowexception=e)
    try:
       # write Parquet File to S3 Bucket, using S3FileSystem object 'fs' from above. Create directories by partition_cols
        pq.write_to_dataset(arrow_table,
                            path_to_s3_object,
                            partition_cols=partition_cols,
                            filesystem=fs,
                            use_dictionary=False,
                            data_page_size=100000,
                            compression="snappy",
                            version="2.0")
    except ArrowTypeError as e:
        raise InvalidDataFrame(errorno=1627575189, dataframe=df, arrowexception=e)
    except aiohttp.client_exceptions.ClientConnectionError as e:
        raise S3ConnectionError(errorno=1627575130, exmsg=e)

Example Result

Expected Result (using pyarrow 10.0.1)

image

Debug output
botocore.endpoint - DEBUG - Sending http request: <AWSPreparedRequest stream_output=False, method=POST, url=http://localhost:9000/my-products/product%3DMy%20Fancy%20Product/date%3D2023-01-05/0d5d1f2c503247
2dbad1d17c845d5432-0.parquet?uploadId=NDBhYjllZDEtNWIxOC00ZTBlLWI4ODYtOGRhZjBhNTg3NzQ5LjYxNTFhMDBlLTQxMmQtNDQ5Ni05YjBjLTBiMGM3ODI3MzhkMg, headers={'User-Agent': b'Botocore/1.27.59 Python/3.10.10 Windows/10
', 'X-Amz-Date': b'20230405T073129Z', 'X-Amz-Content-SHA256': b'41dccb632a0540f4f83eaf7138f97c5dd63c09410cbc3aa3412963b2f7006f18', 'Authorization': b'AWS4-HMAC-SHA256 Credential=******/*******/us-east
-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=7d89128993d6a226d3ac4fa3e6adbb60f638f28c37265446284ca6d629c837f8', 'amz-sdk-invocation-id': b'832bc91b-c285-4413-ad3d-546a3
bcefb59', 'amz-sdk-request': b'attempt=1', 'Content-Length': '357'}>
botocore.parsers - DEBUG - Response headers: HTTPHeaderDict({'accept-ranges': 'bytes', 'cache-control': 'no-cache', 'content-length': '471', 'content-security-policy': 'block-all-mixed-content', 'content-t
ype': 'application/xml', 'etag': '"caca775951f07ca64f530aae539fe5cd-3"', 'server': 'MinIO', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'vary': 'Accept-Encoding', 'x-accel-buffering
': 'no', 'x-amz-id-2': 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'x-amz-request-id': '1752F9747004F3E5', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 
'date': 'Wed, 05 Apr 2023 07:31:29 GMT'})
botocore.parsers - DEBUG - Response body:
b'<?xml version="1.0" encoding="UTF-8"?>\n<CompleteMultipartUploadResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://localhost:9000/my-products/product=My%20Fancy%20Product/date=2023-0
1-05/0d5d1f2c5032472dbad1d17c845d5432-0.parquet</Location><Bucket>my-products</Bucket><Key>product=My Fancy Product/date=2023-01-05/0d5d1f2c5032472dbad1d17c845d5432-0.parquet</Key><ETag>&#34;caca775951f07c
a64f530aae539fe5cd-3&#34;</ETag></CompleteMultipartUploadResult>'
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <function check_for_200_error at 0x0000023C49ACA3B0>
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <aiobotocore.retryhandler.AioRetryHandler object at 0x0000023C4D8B64D0>
botocore.retryhandler - DEBUG - No retry needed.
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <bound method AioS3RegionRedirector.redirect_from_error of <aiobotocore.utils.AioS3RegionRedirector object at 0x000002
3C4D8B6590>>
Actual result (using pyarrow 11.0.0)

image

Debug output
botocore.endpoint - DEBUG - Sending http request: <AWSPreparedRequest stream_output=False, method=POST, url=http://localhost:9000/my-products/product%3DMy%2520Fancy%2520Product/date%3D2023-01-10/a724b93c25
1a486b897eb7b151c622bd-0.parquet?uploadId=NDBhYjllZDEtNWIxOC00ZTBlLWI4ODYtOGRhZjBhNTg3NzQ5LjNlOGIyZmI4LWM4ZDEtNDU0ZS1iNjA0LWMxZjczNTI1NjhmZQ, headers={'User-Agent': b'Botocore/1.27.59 Python/3.10.10 Window
s/10', 'X-Amz-Date': b'20230405T073854Z', 'X-Amz-Content-SHA256': b'316db9078636bc3acba7fc81ff32a5704c08a104bfaea7b5e15bf35db799e260', 'Authorization': b'AWS4-HMAC-SHA256 Credential=*****/*****/us-
east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=0721f578ded50c01c1a64c05d62c628fb35f0e9385ffd3ecfa45423940995a63', 'amz-sdk-invocation-id': b'5b5bc340-7f6b-48cc-bf2a-0
860f8fa859b', 'amz-sdk-request': b'attempt=1', 'Content-Length': '357'}>
botocore.parsers - DEBUG - Response headers: HTTPHeaderDict({'accept-ranges': 'bytes', 'cache-control': 'no-cache', 'content-length': '479', 'content-security-policy': 'block-all-mixed-content', 'content-t
ype': 'application/xml', 'etag': '"f44ab58edcc877c4d00075b9db28e4e5-3"', 'server': 'MinIO', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'vary': 'Accept-Encoding', 'x-accel-buffering
': 'no', 'x-amz-id-2': 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'x-amz-request-id': '1752F9DC0E0CE8AD', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 
'date': 'Wed, 05 Apr 2023 07:38:54 GMT'})
botocore.parsers - DEBUG - Response body:
b'<?xml version="1.0" encoding="UTF-8"?>\n<CompleteMultipartUploadResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://localhost:9000/my-products/product=My%2520Fancy%2520Product/date=20
23-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet</Location><Bucket>my-products</Bucket><Key>product=My%20Fancy%20Product/date=2023-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet</Key><ETag>&#34;f44ab5
8edcc877c4d00075b9db28e4e5-3&#34;</ETag></CompleteMultipartUploadResult>'
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <function check_for_200_error at 0x00000207A8422710>
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <aiobotocore.retryhandler.AioRetryHandler object at 0x00000207ADA9EE30>
botocore.retryhandler - DEBUG - No retry needed.
botocore.hooks - DEBUG - Event needs-retry.s3.CompleteMultipartUpload: calling handler <bound method AioS3RegionRedirector.redirect_from_error of <aiobotocore.utils.AioS3RegionRedirector object at 0x000002
07ADA9EEF0>>

The difference in the debug output is the line starting with botocore.parsers - DEBUG - Response body:. In the XML Part, the Node <Key></Key> contains an URL Encoded string (pyarrow 11.0.0) vs. "human readable" String (pyarrow 10.0.1). But the URL encoded string is not URL encoded at all, as mentioned before e.g. the equal sign = is intepreted as expected.

It seems, that the URL encode/decode(?) isn't done correctly at all?

Wild guess of mine: This behavior might be introduced by: #33598 and/or #33468

Thanks, Sven

Component(s)

Python

westonpace commented 1 year ago

This was introduced by the solution for https://github.com/apache/arrow/issues/33448. It looks like we made a backwards incompatible change here which is unfortunate.

NOTICE: the Equal Sign = is URL encoded for the request, but won't become %3D on S3 filesystem. That means, the URL encoded equal sign = seems to be interpreted correctly

I'm not sure it's relevant to my greater point but I don't think the Equal Sign is encoded in the request:

\product=My%20Fancy%20Product/date=2023-01-10/a724b93c251a486b897eb7b151c622bd-0.parquet\

Unfortunately, it is a tricky problem. The encoding here is not to support HTTP requests (in S3 all these paths go into the HTTP body and are not part of the URI) but instead to support two different problems.

First, we need to support the concept of hive partitioning. In hive partitioning there is a special meaning behind the = and / characters because {x:3, y:7} gets encoded as x=3/y=7. This caused issues if the hive keys or hive values had / or = and so the solution was to encode the value (in retrospect I suppose we should be encoding the keys as well).

Second, most filesystems only support a reserved set of characters. Note that even S3 doesn't fully support space:

Space – Significant sequences of spaces might be lost in some uses (especially multiple spaces)

To solve this problem we are now using uriparser's RFC3986 encode function. This is an imprecise approach. It is converting more characters than strictly needed in all cases. However, there is some precedence for this (Spark) and I fear that anything more narrow would be too complex and/or unintuitive.

I'd support a PR to turn encoding on and off entirely (either as an argument to a partitioning object or part of the write_dataset options). The default could be on and then users could choose to disable this feature. Users are then responsible for ensuring their partitioning values consist of legal characters for their filesystem.

svenatarms commented 1 year ago

Thanks for looking into the issue and tracking down the cause. I like the idea, to be able to turn encoding off for backwards compatibility. On our side we'll change the behavior of our application to ensure that partitioning values won't have characters like whitespace anymore.

westonpace commented 1 year ago

I've labeled this good-first-issue in case anyone wants to take a look at it. I'm happy to provide more context. The steps we need would be:

DarthData410 commented 1 year ago

I took a look at this a bit, and this is not a good fit for me to dive into right now. Maybe some other issue in the future.

jainamshah102 commented 1 year ago

I am interested to work on this issue. Can you provide some guidance and assistance in resolving this.

westonpace commented 1 year ago

@jainamshah102 that's great. You will first want to get a C++ development environment setup for Arrow and make sure you can build and run the tests (this is a complex task). The C++ development guide should help. In addition you might want to look at the first PR guide if you have not made a PR for Arrow before.

Once everything is building correctly you will want to create a unit test that reproduces this issue. This would probably be in cpp/src/arrow/dataset/partition_test.cc. Some general context:

The class arrow::dataset::Partitioning is a pure virtual class (e.g. an interface) that turns paths into expressions and back. For example, a directory partitioning could turn the path /7/12 into the expression x==7 && y == 12. The hive partitioning would turn that same expression into the path /x=7/y=12 (hive partitioning is key=value style and directory partitioning omits the keys).

This is done with two methods Format and Parse. The problem here is with the HivePartitioning class. Currently, in Format, we url encode the path. Then, in parse, we url decode the path. The ask is to add a new option to HivePartitioning (perhaps named escape_paths which, if set to true, will use the current behavior and, if set to false, will skip the url encoding/decoding.

Let me know if you run into more problems.

AlenkaF commented 1 year ago

Hi @jainamshah102 might you be still interesting in tackling this issue?

sahitya-pavurala commented 8 months ago

take

mitchelladam commented 7 months ago

This is the case for GCS as well as S3. we just encountered this when updating from pyarrow 10.0.1 to 14.0.2 but is present in all versions from 11.0.0 onwards. it is present for both the GCSFS library and the pyarrow.fs.GcsFileSystem example code:

Importing necessary libraries

import gcsfs
import pyarrow as pa
import pyarrow.fs as pafs
import pyarrow.dataset as ds
import datetime

Creating a GCSFileSystem instance

fs = gcsfs.GCSFileSystem()

Defining data and schema

data = {
        "some_timestamp": [datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=1),
                            datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=2),
                            datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=3)],
        "value1": ["hello", "world", "foo"],
        "value2": [123, 456, 789]
    }
schema = pa.schema([
    pa.field("some_timestamp", pa.timestamp("ms")),
    pa.field("value1", pa.string()),
    pa.field("value2", pa.int64())
])

Creating a PyArrow Table from the data

result_pya_table = pa.Table.from_pydict(data, schema=schema)

Writing the dataset to a parquet file

ds.write_dataset(
    data=result_pya_table,
    base_dir=f"adam_ryota_data/pyarrowfstest/2023.12.2.post1-10.0.1/",
    format='parquet',
    partitioning=["some_timestamp"],
    partitioning_flavor='hive',
    existing_data_behavior='overwrite_or_ignore',
    basename_template="data-{i}.parquet",
    filesystem=fs
)

10.0.1 results in: image

11.0.0 or higher results in: image

note that it is not part of the overall uri being encoded. only the data within the dataset is affected by this. when using the hive partition as part of the path:

Importing necessary libraries

import gcsfs
import pyarrow as pa
import pyarrow.fs as pafs
import pyarrow.dataset as ds
import datetime

Creating a GCSFileSystem instance

fs = gcsfs.GCSFileSystem()

Defining data and schema

data = {
        # "some_timestamp": [datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=1),
        #                     datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=2),
        #                     datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=3)],
        "value1": ["hello", "world", "foo"],
        "value2": [123, 456, 789]
    }
schema = pa.schema([
    # pa.field("some_timestamp", pa.timestamp("ms")),
    pa.field("value1", pa.string()),
    pa.field("value2", pa.int64())
])

Creating a PyArrow Table from the data

result_pya_table = pa.Table.from_pydict(data, schema=schema)

Writing the dataset to a parquet file


ds.write_dataset(
    data=result_pya_table,
    base_dir=f"adam_ryota_data/manualhive/2023.12.2.post1-10.0.1/some_timestamp=2024-04-07 11:13:27.169/",
    format='parquet',
    # partitioning=["some_timestamp"],
    # partitioning_flavor='hive',
    existing_data_behavior='overwrite_or_ignore',
    basename_template="data-{i}.parquet",
    filesystem=fs
)

even in 11.0.0+ the data is written as expected. image