apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.3k stars 3.48k forks source link

[R] Unable to disable url-encoding #41618

Open r2evans opened 4 months ago

r2evans commented 4 months ago

Describe the bug, including details regarding any error messages, version, and platform.

I have a local datamart of various table schemas using hive partitioning. There are non-arrow (and non-R) tools accessing the directories, it would be nice to not have to search for names both with and without URL encoding. I cannot find an option or an argument that allows me to disable it. I recognize that perhaps S3 buckets might require it, but it seems like a bug (or mis-design?) that we cannot disable this otherwise disruptive and undocumented feature. Is this really silently hard-coded and required in all instances?

The datamart is on a local filesystem, and spaces are (obviously) fully permissible in directory names.

At a minimum, I feel documentation in write_dataset would be appropriate, though it would be really useful to not have to change all other utilities to work around this seemingly unnecessary behavior.

R-4.3.2 and arrow_15.0.1.

mt <- mtcars
mt$key <- paste(mt$cyl, mt$gear)
(td <- tempfile(fileext=".d"))
# [1] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d"
dir.create(td)
res <- arrow::write_dataset(mt, path = td, partitioning = "key")
res
# NULL
Sys.glob(paste0(td, "/*/*"))
# [1] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=4%203/part-0.parquet" "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=4%204/part-0.parquet"
# [3] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=4%205/part-0.parquet" "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=6%203/part-0.parquet"
# [5] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=6%204/part-0.parquet" "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=6%205/part-0.parquet"
# [7] "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=8%203/part-0.parquet" "/home/r2/tmp/RtmpAsPlcj/file185e1d942690.d/key=8%205/part-0.parquet"

There is nothing in the return value that suggests the partitioning keys were url-encoded.

Component(s)

R

amoeba commented 4 months ago

Hi @r2evans, this is something we're aware of, see https://github.com/apache/arrow/issues/34905#issuecomment-1502040774. It's unfortunately not as simple as one approach being clearly better than the other. I don't think anyone's actively working on it so if you wanted to on the work as described there that'd be very welcome.

r2evans commented 4 months ago

Huh, I swear I searched issues for "url" and "encode", don't know why I didn't see that. At least good to know I'm not the only one that finds it not obvious. I understand the issues with something like (e.g.) S3 and not allow spaces, which is why I suggested at least documenting it. The necessary steps/hints in https://github.com/apache/arrow/issues/34905#issuecomment-1523744152 are really useful, though it seems less likely that somebody is going to be able and willing to alter the underlying C++ as well as R and python.

An interesting (to me) note: despite requiring the url-encoding when writing the partitioning values, it does not require them when reading it. This means for my datamart, I can rename the directories immediately post-write (it's part of the datamart process anyway, for various reasons) and nobody is the wiser.

Thanks.

amoeba commented 4 months ago

If that's the case, there may be some latent bugs in the implementation since the original PR that changed things was made.

PS: GitHub's search has gotten worse for people recently so it my currently be harder than normal to find things for a while.