apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.51k stars 3.53k forks source link

[Python] Dedicated flavor value for `DirectoryPartitioning` #43863

Open soxofaan opened 2 months ago

soxofaan commented 2 months ago

Describe the enhancement requested

pyarrow.dataset.partitioning(... flavor...) supports three flavor values:

The default (None) is DirectoryPartitioning. Specify flavor="hive" for a HivePartitioning, and flavor="filename" for a FilenamePartitioning.

So to choose DirectoryPartitioning one has to specify None, which does not feel very future proof (e.g. also see #30888 and #30889 ) and lacks the explicitness and self-documenting properties of the other options ("filename" and "hive").

Wouldn't it be better to support "directory" as a flavor option and make this the default.

This also applies to some related functionality like pyarrow.dataset.write_dataset(...partitioning_flavor...) and pyarrow.dataset.dataset(...partitioning...)

Component(s)

Python

raulcd commented 2 months ago

Are you suggesting something like this:

diff --git a/python/pyarrow/dataset.py b/python/pyarrow/dataset.py
index 1efbfe1..9afb3fe 100644
--- a/python/pyarrow/dataset.py
+++ b/python/pyarrow/dataset.py
@@ -118,7 +118,7 @@ def __getattr__(name):
     )

-def partitioning(schema=None, field_names=None, flavor=None,
+def partitioning(schema=None, field_names=None, flavor="directory",
                  dictionaries=None):
     """
     Specify a partitioning scheme.
@@ -220,7 +220,7 @@ def partitioning(schema=None, field_names=None, flavor=None,

     >>> part = ds.partitioning(flavor="hive")
     """
-    if flavor is None:
+    if flavor is None or flavor == "directory":
         # default flavor
         if schema is not None:
             if field_names is not None:

Plus the other related functionality changes. Being explicit sounds sensible to me, CC @jorisvandenbossche

soxofaan commented 2 months ago

indeed, that would be the core of my feature request