Closed asfimport closed 1 year ago
Weston Pace / @westonpace: Today we URI unescape fields as we read them. For this to work I think we need to URI escape fields as we write them which it does not appear we are doing.
Vadym Dytyniak: @westonpace Can I wait for a fix?
Weston Pace / @westonpace: I'm not personally working on this at the moment but it seems pretty straightforward if someone wants to pick it up.
Vibhatha Lakmal Abeykoon / @vibhatha:
[~dytyniak]
Imagine the following situation
import pandas as pd
import pyarrow as pa
from pyarrow import dataset as ds
path = tmpdir / "slash-writer-x"
df = pd.DataFrame({
'exp_id': [1, 2, 1, 3, 6],
'exp_meta': ["experiment/A/f.csv", "experiment/B/f.csv",
"experiment/A/d.csv", "experiment/C/k.csv",
"experiment/M/i.csv"],
})
dt_table = pa.Table.from_pandas(df)
ds.write_dataset(
data=dt_table,
base_dir=path,
format='parquet',
partitioning=['exp_meta'],
partitioning_flavor='hive',
)
table = ds.dataset(
source=path,
format='parquet',
partitioning='hive',
schema = pa.schema([pa.field("exp_id", pa.int32()), pa.field("exp_meta", pa.utf8())])
).to_table()
print(table)
df = pa.concat_tables([table]).to_pandas()
print(df.head())
If we go for option2, the users won't be able to handle such situation. But we could suggest them to do it in a different way. But it would require them to encode and decode the URIs. If this is billions of raws, and that would be really expensive.
WDYT?
cc @westonpace
Weston Pace / @westonpace: I think we should encode and decode the URIs for the user. In fact, we already decode URIs. I don't expect the encoding to be too expensive.
Vibhatha Lakmal Abeykoon / @vibhatha: @westonpace
So here the context is that, the partition column data is being used to formulate the save directory path. When there is a '/' in data, this value get implicitly considered as a separator when we form the directory path. Thus A/Z
makes a A
folder and Z
inside it. Not sure we can remove that part or ask the code to ignore it.
But, in the reading part, when we recreate the fragments, we could decide whether to consider it as a path or just as a single value. If we consider it as a path (which is being done at the moment), we would get the erroneous output, but if we say don't consider it as a path, but as a non-path, we could retrieve the value accurately.
This is one viable option. If we do that, we can provide a lamda or flag to determine this behavior.
I think a function to determine the key decoding from the file path would be better.
Is this overly complicated or a non-generic solution?
Although I am inclined towards option 1 and not option 2. Option 2 is pretty straightforward to do, but a case as mentioned above could be very common.
How is the URL encoding/decoding part relevant here? Am I missing something?
Could you please clarify a bit?
Weston Pace / @westonpace:
On writing, we encode each component. If there is a /
in the data, it should turn into %2F
and so A/Z
makes A%2FZ
and the file would be stored at A%2FZ/chunk-0.parquet
. On reading, we decode each component. So if we read A%2FZ
then that turns into the string A/Z
.
There is no configuration from the user required and so I think it is simpler.
Vibhatha Lakmal Abeykoon / @vibhatha:
Nice idea, although when we are reading, how do we know if this is encoded field or not? Just imagine the original key was A%2FZ
and even if we detected encoded data, if it was the original data in the dataset, how do we know whether to decode or not? Or we just do encode decode no matter what. Is it a wise thing to do considering the performance?
Antoine Pitrou / @pitrou: Issue resolved by pull request 14646 https://github.com/apache/arrow/pull/14646
Provided example shows that pyarrow does not handle partition value that contains '/' correctly:
Result:
Expected behaviour: Option 1: Result should be:
Option 2: Error should be raised to avoid '/' in partition value.
Reporter: Vadym Dytyniak Assignee: Vibhatha Lakmal Abeykoon / @vibhatha
PRs and other links:
Note: This issue was originally created as ARROW-18269. Please see the migration documentation for further details.