Open r2evans opened 4 months ago
Hi @r2evans, this is something we're aware of, see https://github.com/apache/arrow/issues/34905#issuecomment-1502040774. It's unfortunately not as simple as one approach being clearly better than the other. I don't think anyone's actively working on it so if you wanted to on the work as described there that'd be very welcome.
Huh, I swear I searched issues for "url" and "encode", don't know why I didn't see that. At least good to know I'm not the only one that finds it not obvious. I understand the issues with something like (e.g.) S3 and not allow spaces, which is why I suggested at least documenting it. The necessary steps/hints in https://github.com/apache/arrow/issues/34905#issuecomment-1523744152 are really useful, though it seems less likely that somebody is going to be able and willing to alter the underlying C++ as well as R and python.
An interesting (to me) note: despite requiring the url-encoding when writing the partitioning values, it does not require them when reading it. This means for my datamart, I can rename the directories immediately post-write (it's part of the datamart process anyway, for various reasons) and nobody is the wiser.
Thanks.
If that's the case, there may be some latent bugs in the implementation since the original PR that changed things was made.
PS: GitHub's search has gotten worse for people recently so it my currently be harder than normal to find things for a while.
Describe the bug, including details regarding any error messages, version, and platform.
I have a local datamart of various table schemas using hive partitioning. There are non-arrow (and non-R) tools accessing the directories, it would be nice to not have to search for names both with and without URL encoding. I cannot find an option or an argument that allows me to disable it. I recognize that perhaps S3 buckets might require it, but it seems like a bug (or mis-design?) that we cannot disable this otherwise disruptive and undocumented feature. Is this really silently hard-coded and required in all instances?
The datamart is on a local filesystem, and spaces are (obviously) fully permissible in directory names.
At a minimum, I feel documentation in
write_dataset
would be appropriate, though it would be really useful to not have to change all other utilities to work around this seemingly unnecessary behavior.R-4.3.2 and
arrow_15.0.1
.There is nothing in the return value that suggests the partitioning keys were url-encoded.
Component(s)
R