Open wiedld opened 1 month ago
take
Also, a proposed followup work (to handle the general case):
Consider whether we have a testing gap for ParquetSink:
BTW I think you can write a test for the embedded schema in a parquet file using the describe
command
DataFusion CLI v40.0.0
> copy (values (1)) to '/tmp/foo.parquet';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.051 seconds.
> describe '/tmp/foo.parquet';
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| column1 | Int64 | YES |
+-------------+-----------+-------------+
1 row(s) fetched.
Perhaps we could extend some of the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/copy.slt with describe
as a way to verify your fix
Describe the bug
We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter includes the arrow schema (by default) in the parquet metadata on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's missing here). This missing arrow schema metadata is important, as it's inclusion aids with later reading.
To Reproduce
let file_metadata: FileMetadata = <get from file per API>;
let arrow_schema = parquet_to_arrow_schema( file_metadata.schema_descr(), file_metadata.key_value_metadata(), );
Expected behavior
Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter.
Additional context
No response