Handle BYTE_ARRAY physical type in arrow-json (be able to load files output from pandas with no dtypes)

ehiggs commented 1 year ago

Goal

arrow-json should be able to load parquet files output from python pandas with no dtypes.

Use case

Given the following python code:

import pandas as pd
data = '[{"a": 1, "b": "Hello", "c": {"d": "something"}, "e": [1,2,3]}]'
df = pd.read_json(data, dtype=False, orient='record')
df.to_parquet("test.parquet", engine="fastparquet", object_encoding="json", stats=False)
df2 = pd.read_parquet("test.parquet", engine="fastparquet")
print(df2)
print(df2.dtypes)

This outputs:

   a      b                   c          e
0  1  Hello  {'d': 'something'}  [1, 2, 3]
a     int64
b    object
c    object
e    object
dtype: object

The types aren't great, but it can write and the file is loaded. ✅

Using VSCode parquet-viewer plugin (TypeScript) we can see the loaded data:

The Typescript/Javascript implementation is able to load the file ✅

However, when I try to load this using arrow-json, I seethe following error:

async fn parquet_to_json<T>(data: T) where T: AsyncFileReader + Send + Unpin + 'static {

    let builder = ParquetRecordBatchStreamBuilder::new(data)
        .await
        .unwrap()
        .with_batch_size(3);
    let file_metadata = builder.metadata().file_metadata();
    println!("schema: {:?}", file_metadata.schema_descr());

    let stream = builder.build().unwrap();
    let results = stream.try_collect::<Vec<_>>().await.unwrap();
    let mut out_buf = Vec::new();
    let mut writer = LineDelimitedWriter::new(&mut out_buf);
    writer
        .write_batches(&results)
        .expect("could not write batches");
    let json_out = String::from_utf8_lossy(&out_buf);
    println!("result: {}", json_out);
}

thread 'main' panicked at 'could not write batches: JsonError("data type Binary not supported in nested map for json writer")'

The schema as arrow-rs knows it:

schema: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo { name: "schema", repetition: None, converted_type: NONE, logical_type: None, id: None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "a", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: 64, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "b", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "c", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "e", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }] } }

I don't know what the parquet spec days here but basic files are loadable from other implementations, and being able to read files output from pandas must surely be a significant use case.

Related tickets / PRs:

Related ticket: https://github.com/apache/arrow-rs/issues/154 BinaryArray doesn't exist (anymore?) as I only see Binary as a DataType and BYTE_ARRAY in the schema output, so I wasn't sure if this was the same issue.

There was a previous PR for the above ticket: https://github.com/apache/arrow/pull/8971 which was closed. This looks like this also would have failed to do 'the right thing'.

ehiggs commented 1 year ago

Related: #3291

tustvold commented 1 year ago

Thank you for the report, especially thank you for the reproducer. However, I think the issue here is slightly different from what you have diagnosed.

When writing the parquet file with your reproducer it writes a parquet file with the following schema:

$ cargo run --bin parquet-schema --features cli -- --file-path ~/test.parquet
Metadata for file: /home/raphael/test.parquet

version: 1
num of rows: 1
created by: fastparquet-python version 2022.12.0 (build 0)
metadata:
  pandas: {"column_indexes": [{"field_name": null, "metadata": null, "name": null, "numpy_type": "object", "pandas_type": "mixed-integer"}], "columns": [{"field_name": "a", "metadata": null, "name": "a", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "b", "metadata": null, "name": "b", "numpy_type": "object", "pandas_type": "unicode"}, {"field_name": "c", "metadata": null, "name": "c", "numpy_type": "object", "pandas_type": "mixed"}, {"field_name": "e", "metadata": null, "name": "e", "numpy_type": "object", "pandas_type": "mixed"}], "creator": {"library": "fastparquet", "version": "2022.12.0"}, "index_columns": [{"kind": "range", "name": null, "start": 0, "step": 1, "stop": 1}], "pandas_version": "1.5.2", "partition_columns": []}
message schema {
  OPTIONAL INT64 a;
  OPTIONAL BYTE_ARRAY b (UTF8);
  OPTIONAL BYTE_ARRAY c (JSON);
  OPTIONAL BYTE_ARRAY e (JSON);
}

Arrow doesn't have a JSON type and so doesn't infer the JSON columns as StringArray, and consequently you end up with BinaryArray, which cannot be written to JSON, as JSON doesn't support arbitrary binary data. This is why you then get an error.

If, however, you write the data with pyarrow it produces the "correct" schema for the data

df.to_parquet("test.parquet")

$ cargo run --bin parquet-schema --features cli -- --file-path ~/test.parquet
Metadata for file: /home/raphael/test.parquet

version: 2
num of rows: 1
created by: parquet-cpp-arrow version 10.0.1
metadata:
  pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 1, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "b", "field_name": "b", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "c", "field_name": "c", "pandas_type": "object", "numpy_type": "object", "metadata": null}, {"name": "e", "field_name": "e", "pandas_type": "list[int64]", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "10.0.1"}, "pandas_version": "1.5.2"}
  ARROW:schema:  <REDACTED>
message schema {
  OPTIONAL INT64 a;
  OPTIONAL BYTE_ARRAY b (STRING);
  OPTIONAL group c {
    OPTIONAL BYTE_ARRAY d (STRING);
  }
  OPTIONAL group e (LIST) {
    REPEATED group list {
      OPTIONAL INT64 item;
    }
  }
}

Note how this has correctly preserved the structure, instead of flattening everything to JSON which is catastrophic from a performance, compression and portability perspective. It also has an embedded arrow schema (which I've redacted as it is massive) to ensure things like timezones, etc... are correctly preserved. I would strongly counsel writing the data using pyarrow instead of fastparquet, especially if the intention is to interface with other components in the arrow ecosystem.

Separately I will tweak the schema inference so that it infers JSON columns as UTF-8 data, i.e. StringArray

ehiggs commented 1 year ago

Thanks for the quick feedback. I'll take a look at using pyarrow instead of fastparquet

ByteBaker commented 1 year ago

I was facing the same problem a few days back and came here to create an issue, then found this.

To confirm, I loaded the original file into pandas and then saved to another one using pyarrow as the engine this time and the problem was gone.

The issue is, our dataset is quite large (hundreds of GBs of parquet). And it'll be a daunting task to reload everything. What should I do to handle this issue?

tustvold commented 1 year ago

What should I do to handle this issue

If you aren't able to rewrite the data, another option might be able to read the parquet data and then feed the JSON columns into RawDecoder. It isn't the nicest solution, but we have very limited support for JSON-encoded data within arrays at the moment.

ByteBaker commented 1 year ago

Oddly enough, what we're doing is not writing. It's happening when I call record_batches_to_json_rows after collecting the result as a Vec<RecordBatch>. And the specific field in question isn't JSON data, but actual binary data.

tustvold commented 1 year ago

specific field in question isn't JSON data, but actual binary data

Binary data cannot be represented in JSON, only UTF-8 encoded data. I'm not sure why writing the file with pyarrow would change this, unless it is marking the field as UTF-8.

Perhaps you could share the arrow schema of the fastparquet vs pyarrow files?

tustvold commented 1 year ago

I believe this is closed by #3376, feel free to reopen if I am mistaken

apache / arrow-rs