apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.61k stars 3.55k forks source link

[C++][R] Support a "modified" hive style directory naming scheme #29738

Open asfimport opened 3 years ago

asfimport commented 3 years ago

I am working on a project where I need to create and analyze parquet files using Apache Arrow but the environment I'm working with does not allow "=" in file paths, which the hive naming convention forces, e.g. "year=2007". While I can specify the partitioning to not use the hive contention, I then lose the variable names. This is problematic when I'm sharing the datasets with others because they will have to specify the partitioning variables when opening the dataset but they don't know what the partitioning variables are.

 

Would it be possible to allow a modified hive-style directory naming convention that still preserves the variable name in the directory name? For example, allowing a delimiter other than "="?

Reporter: Ryan Hafen

Note: This issue was originally created as ARROW-14149. Please see the migration documentation for further details.

asfimport commented 3 years ago

Ryan Hafen: Would one option here be that when you don't use hive-style names, then the partitioning variables automatically get stored as metadata when you write the file and then when you read it, it looks for that metadata and specifies the partitioning variables for you? I'm thinking of something like this being implemented in the high-level functions like write_dataset() and open_dataset() in R for example.