apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[Python][Parquet] Attempt to encrypt column of type 'list' produces OSError #41246

Open tritzman opened 6 months ago

tritzman commented 6 months ago

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow 15.0.2

Changing the table definition for example at python/examples/parquet_encryption/sample_vault_kms_client.py to this:

    table = pa.Table.from_pydict({
        'a': pa.array([1, 2, 3]),
        'b': pa.array([['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']]),
        'c': pa.array(['x', 'y', 'z'])
    })

produces an exception: OSError: Encrypted column b not in file schema

I expected the encryption to work on a list type. I didn't test other non-fundamental types (like structs) or nested types. I did look for clarification on the capability in Parquet and Arrow, without much luck. Apologies if I missed something.

Thanks for Arrow, it's quite nice!

Component(s)

Python Parquet

tritzman commented 6 months ago

In my application code, when I call write_dataset, I have a file_visitor that collects metadata as Parquet files are created. Looking at the pyarrow.dataset.WrittenFile's metadata, I find path_in_schema, which shows lists are stored in Parquet with the name <column_name>.list.element. Adding the suffix to the value in col_b_key_name’s, (see column_keys below) results in proper operation, to include the assert comparison between the input table and output table. (ATM I'm not sure how to confirm all data is completely encrypted.)

column_keys={ col_a_key_name: ["a"], col_b_key_name: ["b.list.element"], }

Similarly, my application data includes structs. There I found path_in_schema entries for each field of the struct. I believe this would require a key declaration for each struct field (e.g. <column_name>.field_1, <column_name>.field_2, <column_name>.field_3, etc.

I have not looked into nested structs-of-lists or lists-of-structs to see how those are represented in Parquet.

It seems reasonable to have the developer list the column names to encrypt. But for non-primitive types, I'm not sure how they would know the modified column name used in the file.

In my application code, when writing encrypted Parquet, Python silently crashes in the previously mentioned file visitor. The application just exits with no messages or exceptions. This happens when calling pyarrow.dataset.WrittenFile’s function .metadata.to_dict(). By setting a break point and playing in the debugger, I found the same symptom when accessing meadata.row_group(0)’s to_dict() function. I won't be collecting and writing the _metadata or _common_metadata files when encrypting the data, so this code is normally disabled. But I figured it was worth noting the crash.