Closed ehiggs closed 1 year ago
Related: #3291
Thank you for the report, especially thank you for the reproducer. However, I think the issue here is slightly different from what you have diagnosed.
When writing the parquet file with your reproducer it writes a parquet file with the following schema:
$ cargo run --bin parquet-schema --features cli -- --file-path ~/test.parquet
Metadata for file: /home/raphael/test.parquet
version: 1
num of rows: 1
created by: fastparquet-python version 2022.12.0 (build 0)
metadata:
pandas: {"column_indexes": [{"field_name": null, "metadata": null, "name": null, "numpy_type": "object", "pandas_type": "mixed-integer"}], "columns": [{"field_name": "a", "metadata": null, "name": "a", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "b", "metadata": null, "name": "b", "numpy_type": "object", "pandas_type": "unicode"}, {"field_name": "c", "metadata": null, "name": "c", "numpy_type": "object", "pandas_type": "mixed"}, {"field_name": "e", "metadata": null, "name": "e", "numpy_type": "object", "pandas_type": "mixed"}], "creator": {"library": "fastparquet", "version": "2022.12.0"}, "index_columns": [{"kind": "range", "name": null, "start": 0, "step": 1, "stop": 1}], "pandas_version": "1.5.2", "partition_columns": []}
message schema {
OPTIONAL INT64 a;
OPTIONAL BYTE_ARRAY b (UTF8);
OPTIONAL BYTE_ARRAY c (JSON);
OPTIONAL BYTE_ARRAY e (JSON);
}
Arrow doesn't have a JSON type and so doesn't infer the JSON
columns as StringArray
, and consequently you end up with BinaryArray
, which cannot be written to JSON, as JSON doesn't support arbitrary binary data. This is why you then get an error.
If, however, you write the data with pyarrow it produces the "correct" schema for the data
df.to_parquet("test.parquet")
$ cargo run --bin parquet-schema --features cli -- --file-path ~/test.parquet
Metadata for file: /home/raphael/test.parquet
version: 2
num of rows: 1
created by: parquet-cpp-arrow version 10.0.1
metadata:
pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 1, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "b", "field_name": "b", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "c", "field_name": "c", "pandas_type": "object", "numpy_type": "object", "metadata": null}, {"name": "e", "field_name": "e", "pandas_type": "list[int64]", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "10.0.1"}, "pandas_version": "1.5.2"}
ARROW:schema: <REDACTED>
message schema {
OPTIONAL INT64 a;
OPTIONAL BYTE_ARRAY b (STRING);
OPTIONAL group c {
OPTIONAL BYTE_ARRAY d (STRING);
}
OPTIONAL group e (LIST) {
REPEATED group list {
OPTIONAL INT64 item;
}
}
}
Note how this has correctly preserved the structure, instead of flattening everything to JSON which is catastrophic from a performance, compression and portability perspective. It also has an embedded arrow schema (which I've redacted as it is massive) to ensure things like timezones, etc... are correctly preserved. I would strongly counsel writing the data using pyarrow instead of fastparquet, especially if the intention is to interface with other components in the arrow ecosystem.
Separately I will tweak the schema inference so that it infers JSON
columns as UTF-8
data, i.e. StringArray
Thanks for the quick feedback. I'll take a look at using pyarrow instead of fastparquet
I was facing the same problem a few days back and came here to create an issue, then found this.
To confirm, I loaded the original file into pandas and then saved to another one using pyarrow as the engine this time and the problem was gone.
The issue is, our dataset is quite large (hundreds of GBs of parquet). And it'll be a daunting task to reload everything. What should I do to handle this issue?
What should I do to handle this issue
If you aren't able to rewrite the data, another option might be able to read the parquet data and then feed the JSON columns into RawDecoder. It isn't the nicest solution, but we have very limited support for JSON-encoded data within arrays at the moment.
Oddly enough, what we're doing is not writing. It's happening when I call record_batches_to_json_rows
after collecting the result as a Vec<RecordBatch>
. And the specific field in question isn't JSON data, but actual binary data.
specific field in question isn't JSON data, but actual binary data
Binary data cannot be represented in JSON, only UTF-8 encoded data. I'm not sure why writing the file with pyarrow would change this, unless it is marking the field as UTF-8.
Perhaps you could share the arrow schema of the fastparquet vs pyarrow files?
I believe this is closed by #3376, feel free to reopen if I am mistaken
Goal
arrow-json should be able to load parquet files output from python pandas with no dtypes.
Use case
Given the following python code:
This outputs:
The types aren't great, but it can write and the file is loaded. ✅
Using VSCode parquet-viewer plugin (TypeScript) we can see the loaded data:
The Typescript/Javascript implementation is able to load the file ✅
However, when I try to load this using
arrow-json
, I seethe following error:The schema as
arrow-rs
knows it:I don't know what the parquet spec days here but basic files are loadable from other implementations, and being able to read files output from pandas must surely be a significant use case.
Related tickets / PRs:
Related ticket: https://github.com/apache/arrow-rs/issues/154 BinaryArray doesn't exist (anymore?) as I only see
Binary
as aDataType
andBYTE_ARRAY
in the schema output, so I wasn't sure if this was the same issue.There was a previous PR for the above ticket: https://github.com/apache/arrow/pull/8971 which was closed. This looks like this also would have failed to do 'the right thing'.