apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.91k stars 1.12k forks source link

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

Open wiedld opened 1 month ago

wiedld commented 1 month ago

Describe the bug

We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter includes the arrow schema (by default) in the parquet metadata on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's missing here). This missing arrow schema metadata is important, as it's inclusion aids with later reading.

To Reproduce

  1. Write parquet with ParquetSink.
  2. Write parquet with ArrowWriter (default options).
  3. Attempt to read the arrow schema from the parquet metadata, using the below/linked APIs:

let file_metadata: FileMetadata = <get from file per API>;

let arrow_schema = parquet_to_arrow_schema( file_metadata.schema_descr(), file_metadata.key_value_metadata(), );

  1. An error is returned for parquet written by ParquetSink.

Expected behavior

Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter.

Additional context

No response

wiedld commented 1 month ago

take

wiedld commented 1 month ago

Also, a proposed followup work (to handle the general case):

Consider whether we have a testing gap for ParquetSink:

alamb commented 1 month ago

BTW I think you can write a test for the embedded schema in a parquet file using the describe command

DataFusion CLI v40.0.0
> copy (values (1)) to '/tmp/foo.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.051 seconds.

> describe '/tmp/foo.parquet';
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| column1     | Int64     | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.

Perhaps we could extend some of the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/copy.slt with describe as a way to verify your fix