Arrow schema is missing from the parquet metadata, for files written by ParquetSink.

wiedld commented 1 month ago

Describe the bug

We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter includes the arrow schema (by default) in the parquet metadata on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's missing here). This missing arrow schema metadata is important, as it's inclusion aids with later reading.

To Reproduce

Write parquet with ParquetSink.
Write parquet with ArrowWriter (default options).
Attempt to read the arrow schema from the parquet metadata, using the below/linked APIs:

let file_metadata: FileMetadata = <get from file per API>;

let arrow_schema = parquet_to_arrow_schema( file_metadata.schema_descr(), file_metadata.key_value_metadata(), );

An error is returned for parquet written by ParquetSink.

Expected behavior

Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter.

Additional context

No response

wiedld commented 1 month ago

take

wiedld commented 1 month ago

Also, a proposed followup work (to handle the general case):

Consider whether we have a testing gap for ParquetSink:

do we need to have more e2es which ensure that the parquet encoders all encode uniformly?
e.g. parquet encoded by either ArrowWriter (under defaults) or ParquetSink (under defaults) should be identical.

alamb commented 1 month ago

BTW I think you can write a test for the embedded schema in a parquet file using the describe command

DataFusion CLI v40.0.0
> copy (values (1)) to '/tmp/foo.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.051 seconds.

> describe '/tmp/foo.parquet';
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| column1     | Int64     | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.

Perhaps we could extend some of the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/copy.slt with describe as a way to verify your fix

apache / datafusion