Unable to read DuckDB files with offset errors

devinrsmith commented 1 year ago

We are unable to read in a parquet file that was created via DuckDB - polars, clickhouse, datafusion, pyspark, and pyarrow are able to read this file. Upon further investigation, it appears to be a DuckDB writing issue. See linked DuckDB issue.

shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@3c92c520
        at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1114)
        at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1025)
        at org.apache.parquet.format.PageHeader.read(PageHeader.java:902)
        at org.apache.parquet.format.Util.read(Util.java:363)
        at org.apache.parquet.format.Util.readPageHeader(Util.java:133)
        at org.apache.parquet.format.Util.readPageHeader(Util.java:128)
        at io.deephaven.parquet.base.ColumnChunkReaderImpl$ColumnPageReaderIteratorImpl.next(ColumnChunkReaderImpl.java:227)
        at io.deephaven.parquet.base.ColumnChunkReaderImpl$ColumnPageReaderIteratorImpl.next(ColumnChunkReaderImpl.java:195)
        at io.deephaven.parquet.table.pagestore.VariablePageSizeColumnChunkPageStore.extendOnePage(VariablePageSizeColumnChunkPageStore.java:69)
        at io.deephaven.parquet.table.pagestore.VariablePageSizeColumnChunkPageStore.fillToRow(VariablePageSizeColumnChunkPageStore.java:104)
        at io.deephaven.parquet.table.pagestore.VariablePageSizeColumnChunkPageStore.getPageContaining(VariablePageSizeColumnChunkPageStore.java:152)
        at io.deephaven.parquet.table.pagestore.VariablePageSizeColumnChunkPageStore.getPageContaining(VariablePageSizeColumnChunkPageStore.java:20)
        at io.deephaven.engine.page.PageStore.fillChunk(PageStore.java:75)
        at io.deephaven.parquet.table.region.ParquetColumnRegionBase.fillChunk(ParquetColumnRegionBase.java:50)

There may be an small example file, but in the meantime the files were created via the DuckDB:

CREATE TABLE lineitem AS SELECT * FROM 'lineitemsf1.snappy.parquet'

import duckdb
import pathlib
sf=10

for x in range(0, sf) :
  con=duckdb.connect()
  con.sql('PRAGMA disable_progress_bar;SET preserve_insertion_order=false')
  con.sql(f"CALL dbgen(sf={sf} , children ={sf}, step = {x})") 
  for tbl in ['lineitem'] :
     pathlib.Path(f'./{tbl}').mkdir(parents=True, exist_ok=True) 
     con.sql(f"COPY (SELECT * FROM {tbl}) TO './{tbl}/{x}.parquet' ")
  con.close()

https://colab.research.google.com/drive/1pfAPpIG7jpvGB_aHj-PXX66vRaRT0xlj#scrollTo=eVV-nZ_THdVx https://bwlewis.github.io/duckdb_and_r/tpch/tpch.html https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet

I'll attach a simpler reproduction setup or small file if I'm able to produce one.

devinrsmith commented 1 year ago

Here's a minimal example - essentially, one of the tables above limit to 2 items (limiting to 1 item did not reproduce the issues0.

0.parquet.zip

Note: the above file is NOT a zip file (GH does not allow .parquet extensions). Remove the .zip extension to use.

devinrsmith commented 1 year ago

In the case of the example file, the code errors out in the same exact place, but complains about a different field (which makes sense if the binary isn't what we think the binary is...):

 Required field 'compressed_page_size' was not found in serialized data!

devinrsmith commented 1 year ago

It seems like our usage of parquet tooling, or some of the files we've modified in that effort, are at issue. Using the same java code via https://github.com/apache/parquet-mr/tree/master/parquet-cli (both with 1.12.3 and newer) is able to successfully read this file.

devinrsmith commented 10 months ago

In similar cases, I'm getting:

Caused by: java.lang.RuntimeException: Error reading page header                                                                                                                              
        at io.deephaven.parquet.base.ColumnChunkReaderImpl$ColumnPageReaderIteratorImpl.next(ColumnChunkReaderImpl.java:262)                                                                  
        at io.deephaven.parquet.base.ColumnChunkReaderImpl$ColumnPageReaderIteratorImpl.next(ColumnChunkReaderImpl.java:195)                                                                  
        at io.deephaven.parquet.table.pagestore.VariablePageSizeColumnChunkPageStore.extendOnePage(VariablePageSizeColumnChunkPageStore.java:69)                                              
        at io.deephaven.parquet.table.pagestore.VariablePageSizeColumnChunkPageStore.fillToRow(VariablePageSizeColumnChunkPageStore.java:104)                                                 
        at io.deephaven.parquet.table.pagestore.VariablePageSizeColumnChunkPageStore.getPageContaining(VariablePageSizeColumnChunkPageStore.java:152)                                         
        at io.deephaven.parquet.table.pagestore.VariablePageSizeColumnChunkPageStore.getPageContaining(VariablePageSizeColumnChunkPageStore.java:20)                                          
        at io.deephaven.engine.page.PageStore.fillChunk(PageStore.java:75)                                                                                                                    
        at io.deephaven.parquet.table.region.ParquetColumnRegionBase.fillChunk(ParquetColumnRegionBase.java:50)                                                                               
        at io.deephaven.engine.page.PageStore.fillChunk(PageStore.java:79)                                                                                                                    
        at io.deephaven.engine.page.PageStore.fillChunk(PageStore.java:79)                                                                                                                    
        at io.deephaven.engine.table.impl.sources.regioned.RegionedColumnSourceBase.fillChunk(RegionedColumnSourceBase.java:51)                                                               
        at io.deephaven.engine.table.impl.sources.regioned.RegionedColumnSourceObject$AsValues.fillChunk(RegionedColumnSourceObject.java:32)                                                  
        at io.deephaven.engine.table.impl.sources.RedirectedColumnSource$FillContext.doOrderedFillAscending(RedirectedColumnSource.java:807)                                                  
        at io.deephaven.engine.table.impl.sources.RedirectedColumnSource.doFillChunk(RedirectedColumnSource.java:520)                                                                         
        at io.deephaven.engine.table.impl.sources.RedirectedColumnSource.fillChunk(RedirectedColumnSource.java:498)                                                                           
        at io.deephaven.engine.table.impl.remote.ConstructSnapshot.getSnapshotDataAsChunkList(ConstructSnapshot.java:1624)                                                                    
        at io.deephaven.engine.table.impl.remote.ConstructSnapshot.serializeAllTable(ConstructSnapshot.java:1514)                                                                             
        at io.deephaven.engine.table.impl.remote.ConstructSnapshot.lambda$constructBackplaneSnapshotInPositionSpace$2(ConstructSnapshot.java:703)                                             
        at io.deephaven.engine.table.impl.remote.ConstructSnapshot.callDataSnapshotFunction(ConstructSnapshot.java:1243)                                                                      
        ... 15 more                                                                                                                                                                           
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 13                                                                             
        at org.apache.parquet.format.Util.read(Util.java:366)                                                                                                                                 
        at org.apache.parquet.format.Util.readPageHeader(Util.java:133)                                                                                                                       
        at org.apache.parquet.format.Util.readPageHeader(Util.java:128)                                                                                                                       
        at io.deephaven.parquet.base.ColumnChunkReaderImpl$ColumnPageReaderIteratorImpl.next(ColumnChunkReaderImpl.java:227)                                                                  
        ... 33 more                                                                                                                                                                           
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: don't know what type: 13                                                                                             
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:899)                                                                                     
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readListBegin(TCompactProtocol.java:598)                                                                                
        at org.apache.parquet.format.InterningProtocol.readListBegin(InterningProtocol.java:171)                                                                                              
        at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:136)                                                                                               
        at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)                                                                                                
        at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1106)                                                                                           
        at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1025)                                                                                           
        at org.apache.parquet.format.PageHeader.read(PageHeader.java:902)                                                                                                                     
        at org.apache.parquet.format.Util.read(Util.java:363)                                                                                                                                 
        ... 36 more

devinrsmith commented 4 months ago

This can be demonstrated with the DuckDB query:

COPY (
  SELECT
     CAST(generate_series % 32 AS STRING) as X 
  FROM
   generate_series(1000)
) TO 'bad-offsets.parquet' (FORMAT PARQUET);

malhotrashivam commented 4 months ago

This is what the page headers look like for this file

Start of chunk (rowGroup: 0, columnName: X, dictPageOffset: 4, dataPageOffset: 142, numValues: 1001, totalSize: 344)
Page 0. (offset: 4, headerSize: 15)
{
  "compressed_page_size" : 138,
  "dictionary_page_header" : {
    "encoding" : 0,
    "num_values" : 32
  },
  "type" : 2,
  "uncompressed_page_size" : 182
}
Page 1. (offset: 157, headerSize: 20)
{
  "compressed_page_size" : 171,
  "data_page_header" : {
    "definition_level_encoding" : 3,
    "encoding" : 8,
    "num_values" : 1001,
    "repetition_level_encoding" : 3
  },
  "type" : 0,
  "uncompressed_page_size" : 2010
}
End of chunk (offset: 347)

(The above is generated using https://github.com/apache/parquet-mr/tree/master/parquet-cli)

The dataPageOffset for this chunk is 142. This should point to "Byte offset from beginning of file to first data page". We have two pages in this chunk, first is the dictionary page (notice the dictionary_page_header) and second is the data page (notice the data_page_header). The offset for page 1 or the first data page in the file is 157 and not 142, which I think is a mistake. It looks like the generation code missed adding the headerSize from dictionary page to calculate the dataPageOffset for this chunk.

Our code starts reading the data page from position 142 and then doesn't find some required fiedls in the header, since its reinterpreting the wrong bytes. My guess is that other tools can read this file properly because they might read the pages serially. So read the dictioanary page, then the data page. But our code directly goes by offsets and reinterprets wrong bytes.

deephaven / deephaven-core

Unable to read DuckDB files with offset errors #3651