apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.9k stars 3.38k forks source link

[C++][Dataset] Support URL encoding of partition field values for the file path #29439

Open asfimport opened 2 years ago

asfimport commented 2 years ago

In ARROW-12644, we added support for decoding the file paths when reading datasets. So a valid follow-up question: should we also support encoding when writing datasets?

(see also https://github.com/apache/arrow/issues/11027)

Rereading ARROW-12644, there wasn't yet much discussion on that aspect.

cc @westonpace @lidavidm

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-13813. Please see the migration documentation for further details.

asfimport commented 2 years ago

David Li / @lidavidm: I would think we should be able to.

asfimport commented 2 years ago

Weston Pace / @westonpace: I think my only concern is that this is something the user should be able to easily do themselves using the compute stuff. They could use a scanner to read in their data, project the offending column to an encoding kernel, and then partition on the projected column.

However, since we already have segment encoding in partition objects it seems straightforward enough to provide. It might be a good project to pair with ARROW-11378 if someone is looking for some good beginner C++ tasks.

asfimport commented 2 years ago

David Li / @lidavidm: Then should we make this JIRA about implementing a URL-encode (and probably a corresponding URL-decode) kernel?

asfimport commented 2 years ago

Weston Pace / @westonpace: That can be a new JIRA. I think we can keep this open as it is to discuss and track using those kernels to provide utility in a partitioning object.

asfimport commented 2 years ago

Weston Pace / @westonpace: I'll create the new JIRA

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: Currently we don't support "transforms" for the partitioning column (something we maybe should? so that you could also say "year(date_column)" to partition on), which means that you need to calculate such URL encoded column up front, which is not necessarily ideal, both performance/memory wise (although in a (lazy/batched) query execution context, this might not matter) and from a usability context.

(especially for that last point (usability), I think it would be nice to support this within the "write to partitioned data" step)

Another thought: currently, when you partition on a column with string values, you can easily and silently get invalid directories. Small code snippet:


import pyarrow.dataset as ds

table = pa.table({'a': ['A', 'B', 'A/B', ''], 'b': range(4)})
ds.write_dataset(table, "test_dataset_invalid_strings.parquet", format="parquet", partitioning=ds.partitioning(table.select(['a']).schema))
ds.dataset("test_dataset_invalid_strings.parquet/", partitioning=['a']).to_table().to_pandas()

Not necessarily directly this issue, but related question: should we write this silently? Or should we actually check when creating the file path that the value inserted for the partition field is a "valid" string for file paths (so eg no /, not an empty string, ..), and raise an error instead of creating a wrong dataset? Or should we automatically encode those? (I don't know what the overhead would be of validating those strings, but it would also only be needed for certain data types, eg not if your partition field was originally integer)

asfimport commented 2 years ago

David Li / @lidavidm:

Currently we don't support "transforms" for the partitioning column (something we maybe should? so that you could also say "year(date_column)" to partition on), which means that you need to calculate such URL encoded column up front, which is not necessarily ideal, both performance/memory wise (although in a (lazy/batched) query execution context, this might not matter) and from a usability context. I agree, this would be useful to support. I assume once we have a node for writing, this will be easier. should we write this silently? Or should we actually check when creating the file path that the value inserted for the partition field is a "valid" string for file paths (so eg no /, not an empty string, ..), and raise an error instead of creating a wrong dataset? Or should we automatically encode those? On Windows there are more invalid path characters to consider, but I agree, we should raise an error instead of writing unreadable data. (I think for auto-encoding values, that can be done by bindings and doesn't necessarily have to be done in C++? e.g. the Python/R bindings could offer an option to wrap string partition fields in a URL-encode pass.)