apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.31k stars 3.48k forks source link

[C++] Slash character in partition value handling #33448

Closed asfimport closed 1 year ago

asfimport commented 1 year ago

 

Provided example shows that pyarrow does not handle partition value that contains '/' correctly:


import pandas as pd
import pyarrow as pa

from pyarrow import dataset as ds

df = pd.DataFrame({
    'value': [1, 2],
    'instrument_id': ['A/Z', 'B'],
})

ds.write_dataset(
    data=pa.Table.from_pandas(df),
    base_dir='data',
    format='parquet',
    partitioning=['instrument_id'],
    partitioning_flavor='hive',
)

table = ds.dataset(
    source='data',
    format='parquet',
    partitioning='hive',
).to_table()

tables = [table]

df = pa.concat_tables(tables).to_pandas()  tables = [table]

df = pa.concat_tables(tables).to_pandas() 

print(df.head())

Result:


   value instrument_id
0      1             A
1      2             B 

Expected behaviour: Option 1: Result should be:


   value instrument_id
0      1             A/Z
1      2             B 

Option 2: Error should be raised to avoid '/' in partition value.

 

 

 

Reporter: Vadym Dytyniak Assignee: Vibhatha Lakmal Abeykoon / @vibhatha

PRs and other links:

Note: This issue was originally created as ARROW-18269. Please see the migration documentation for further details.

asfimport commented 1 year ago

Weston Pace / @westonpace: Today we URI unescape fields as we read them. For this to work I think we need to URI escape fields as we write them which it does not appear we are doing.

asfimport commented 1 year ago

Vadym Dytyniak: @westonpace Can I wait for a fix?

asfimport commented 1 year ago

Weston Pace / @westonpace: I'm not personally working on this at the moment but it seems pretty straightforward if someone wants to pick it up.

asfimport commented 1 year ago

Vibhatha Lakmal Abeykoon / @vibhatha: [~dytyniak] Imagine the following situation


    import pandas as pd
    import pyarrow as pa

    from pyarrow import dataset as ds

    path = tmpdir / "slash-writer-x"

    df = pd.DataFrame({
        'exp_id': [1, 2, 1, 3, 6],
        'exp_meta': ["experiment/A/f.csv", "experiment/B/f.csv", 
                     "experiment/A/d.csv", "experiment/C/k.csv",
                     "experiment/M/i.csv"],
    })

    dt_table = pa.Table.from_pandas(df)

    ds.write_dataset(
        data=dt_table,
        base_dir=path,
        format='parquet',
        partitioning=['exp_meta'],
        partitioning_flavor='hive',
    )

    table = ds.dataset(
        source=path,
        format='parquet',
        partitioning='hive',
        schema = pa.schema([pa.field("exp_id", pa.int32()), pa.field("exp_meta", pa.utf8())])
    ).to_table()

    print(table)

    df = pa.concat_tables([table]).to_pandas()  

    print(df.head())

If we go for option2, the users won't be able to handle such situation. But we could suggest them to do it in a different way. But it would require them to encode and decode the URIs. If this is billions of raws, and that would be really expensive.

WDYT?

cc @westonpace

asfimport commented 1 year ago

Weston Pace / @westonpace: I think we should encode and decode the URIs for the user. In fact, we already decode URIs. I don't expect the encoding to be too expensive.

asfimport commented 1 year ago

Vibhatha Lakmal Abeykoon / @vibhatha: @westonpace 

So here the context is that, the partition column data is being used to formulate the save directory path. When there is a '/' in data, this value get implicitly considered as a separator when we form the directory path. Thus A/Z makes a A folder and Z inside it. Not sure we can remove that part or ask the code to ignore it. 

But, in the reading part, when we recreate the fragments, we could decide whether to consider it as a path or just as a single value. If we consider it as a path (which is being done at the moment), we would get the erroneous output, but if we say don't consider it as a path, but as a non-path, we could retrieve the value accurately. 

This is one viable option. If we do that, we can provide a lamda or flag to determine this behavior. 

I think a function to determine the key decoding from the file path would be better. 

Is this overly complicated or a non-generic solution?

Although I am inclined towards option 1 and not option 2. Option 2 is pretty straightforward to do, but a case as mentioned above could be very common.

How is the URL encoding/decoding part relevant here? Am I missing something?

Could you please clarify a bit? 

asfimport commented 1 year ago

Weston Pace / @westonpace: On writing, we encode each component. If there is a / in the data, it should turn into %2F and so A/Z makes A%2FZ and the file would be stored at A%2FZ/chunk-0.parquet. On reading, we decode each component. So if we read A%2FZ then that turns into the string A/Z.

There is no configuration from the user required and so I think it is simpler.

asfimport commented 1 year ago

Vibhatha Lakmal Abeykoon / @vibhatha: Nice idea, although when we are reading, how do we know if this is encoded field or not? Just imagine the original key was A%2FZ and even if we detected encoded data, if it was the original data in the dataset, how do we know whether to decode or not? Or we just do encode decode no matter what. Is it a wise thing to do considering the performance?

asfimport commented 1 year ago

Antoine Pitrou / @pitrou: Issue resolved by pull request 14646 https://github.com/apache/arrow/pull/14646