aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

Gracefully handle malformed fields with trailing bytes in the data #413

Closed mukunku closed 7 months ago

mukunku commented 9 months ago

Summary

One of my users recently shared a bug where they couldn't read INT64 columns exported from Oracle: https://github.com/mukunku/ParquetViewer/issues/81

The error being:

Destination is too short. (Parameter 'destination')
at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data)
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead)
   at Parquet.File.DataColumnReader.ReadColumn(Span`1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc)
   at Parquet.File.DataColumnReader.ReadDataPageV1Async(PageHeader ph, PackedColumn pc)
   at Parquet.File.DataColumnReader.ReadAsync(CancellationToken cancellationToken)
   at ParquetViewer.Engine.ParquetEngine.ReadPrimitiveField(DataTable dataTable, ParquetRowGroupReader groupReader, Int32 rowBeginIndex, ParquetSchemaElement field, Int64 skipRecords, Int64 readRecords, Boolean isFirstColumn, Dictionary`2 rowLookupCache, CancellationToken cancellationToken, IProgress`1 progress)
   at ParquetViewer.Engine.ParquetEngine.ProcessRowGroup(DataTable dataTable, ParquetRowGroupReader groupReader, Int64 skipRecords, Int64 readRecords, CancellationTok

At the time I investigated this issue and concluded that the file must be malformed. But I've been monitoring for this exception since then and I've noticed a few more users continuing to get the same error: image

I also noticed other libraries don't seem to have issues opening this file. So even though the file is malformed in my opinion, would it still be worth gracefully processing such malformed fields so parquet-dotnet doesn't fall behind the competition? I mean, even if a file is malformed, if other libraries are supporting it but parquet-dotnet isn't that might cause people to prefer other libraries over this one.

However if this PR doesn't make sense I'm happy to close it out. Just wanted to get your opinion.