apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.51k stars 746 forks source link

Error: `missing required field ColumnIndex.null_pages` when loading page indexes #6464

Open alamb opened 4 days ago

alamb commented 4 days ago

Describe the bug If the ParquetMetadataReader tries to read metadata written by ParquetMetaDataWriter without first loading the page indexes, you get an error like "missing required field ColumnIndex.null_pages"

Nite this depends on https://github.com/apache/arrow-rs/pull/6463

To Reproduce The full reproducer is in https://github.com/apache/arrow-rs/pull/6463. Here is the relevant piece

        let parquet_bytes = create_parquet_file();

        // read the metadata from the file WITHOUT the page index structures
        let original_metadata = ParquetMetaDataReader::new()
            .parse_and_finish(&parquet_bytes)
            .unwrap();

        // read metadata back from the serialized bytes requesting to read the offsets
        let metadata_bytes = metadata_to_bytes(&original_metadata);
        let roundtrip_metadata = ParquetMetaDataReader::new()
            .with_page_indexes(true) // there are no page indexes in the metadata
            .parse_and_finish(&metadata_bytes)
            .unwrap(); // <******* This fails

Expected behavior The reader should not error

I am not sure if the right fix is to

  1. change the ParquetMetadataWriter to clear the index offset fields befor writing them
  2. change the ParquetMetadataReader to ignore bogus offsets
  3. SOmething else

Additional context @etseidl has added the APIs in https://github.com/apache/arrow-rs/pull/6431

alamb commented 4 days ago

@etseidl predicted this error in https://github.com/apache/arrow-rs/pull/6081/files#r1774020124

I wonder if the metadata writer needs to modify the page index offsets/lengths in the ColumnMetaData if the indexes are not present in the ParquetMetaData. Then again, I could see wanting to preserve the page index offsets of the original file if you only want to save the footer metadata externally...perhaps an option on the metadata writer to preserve page index offsets if desired?