aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
600 stars 151 forks source link

[BUG]: Destination is too short. (Parameter 'destination') Error from 4.0.0 onwards for Specific Parquet Files #414

Closed notruilin closed 10 months ago

notruilin commented 11 months ago

Library Version

4.0.0

OS

Windows

OS Architecture

64 bit

How to reproduce?

Greetings!

I've come across a problem with reading a specific Parquet file using versions from 4.0.0 and onwards.

Destination is too short. (Parameter 'destination')
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data) in C:\dev\parquet-dotnet - Copy\src\Parquet\Encodings\ParquetPlainEncoder.cs:line 757
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead) in C:\dev\parquet-dotnet - Copy\src\Parquet\Encodings\ParquetPlainEncoder.cs:line 180
   at Parquet.File.DataColumnReader.ReadColumn(Span`1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc) in C:\dev\parquet-dotnet - Copy\src\Parquet\File\DataColumnReader.cs:line 271
   at Parquet.File.DataColumnReader.ReadDataPageV1Async(PageHeader ph, PackedColumn pc) in C:\dev\parquet-dotnet - Copy\src\Parquet\File\DataColumnReader.cs:line 196
   at Parquet.File.DataColumnReader.ReadAsync(CancellationToken cancellationToken) in C:\dev\parquet-dotnet - Copy\src\Parquet\File\DataColumnReader.cs:line 79
   at Parquet.ParquetReader.ReadEntireRowGroupAsync(Int32 rowGroupIndex) in C:\dev\parquet-dotnet - Copy\src\Parquet\ParquetReader.cs:line 151

Interestingly, this file was readable without issues before version 3.10.0. Additionally, the file can be successfully read by other libraries such as pyarrow in Python and org.apache.parquet, pointing towards its validity.

Due to internal constraints, I'm unable to share the sample file. But I've analyzed the problem and here's what I've found:

Source Data Reading: For PageHeader.Type equalling PageType.DATA_PAGE, the method Parquet.File.DataColumnReader.ReadDataPageV1Async calculates byte size based on ph.CompressedPageSize and ph.UncompressedPageSize.

Data Transfer from Source to Target: In Parquet.Encodings.ParquetPlainEncoder.Decode, the byte length is decided by _definedDataCount in PackedColumn, which refers to the row count in the metadata.

In my case, the uncompressed_page_size is longer than the actual row count

In version >= 4.0.0: Like the screenshot showing below: image

In version 3.10.0, a protective measure exists: image

Questions:

Appreciate your assistance. Kindly let me know if more information is needed. Thanks!

Failing test

No response

murermader commented 11 months ago

I just wanted to add that I am experiencing the same issue with files created by pandas by using the function df.to_parquet(path, engine="fastparquet"). I am looking at these files using the program ParquetViewer, which crashes due to this error somewhere in Parquet.NET. The files can be read without problems in pandas or R, so I think the parquet files are not the problem.

This is the file (zipped) I am having issues with: df_fam_places.zip

Some info about the file:

andreamari1993 commented 11 months ago

I found the same issue using the package. I'm using the version 4.16.4 on both OS (Windows and Linux). Below you can find the stacktrace.

An error occurred: System.ArgumentException: Destination is too short. (Parameter 'destination') at Parquet.Encodings.ParquetPlainEncoder.Decode(Span1 source, Span1 data) at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span1 source, Int32& elementsRead) at Parquet.File.DataColumnReader.ReadColumn(Span1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc) at Parquet.File.DataColumnReader.ReadDataPageV1Async(PageHeader ph, PackedColumn pc) at Parquet.File.DataColumnReader.ReadAsync(CancellationToken cancellationToken) at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetReader reader, Int32 rgi, Assembler1 asm, ICollection1 result, CancellationToken cancellationToken) at Parquet.Serialization.ParquetSerializer.DeserializeAsync[T](Stream source, ParquetOptions options, CancellationToken cancellationToken)

Is there a workaround? Let me know if you need other details reagarding the issue. Thanks, Andrea

aloneguid commented 10 months ago

This should be fixed in the latest release.

akaloshych84 commented 5 months ago

Doesn't seem to be fixed. I have a large file with a complex schema and deserialization failure for int64 and double columns. When I commented out class properties deserialization works.

aloneguid commented 5 months ago

Doesn't seem to be fixed. I have a large file with a complex schema and deserialization failure for int64 and double columns. When I commented out class properties deserialization works.

Then it's most probably miscommunication. I'd recommend raising a new issue and include the failing unit test so we are 100% clear what's not working.

akaloshych84 commented 5 months ago

Thank you so much for the quick reply!

I have a few issues with deserializer, I submitted another one which is easier to reproduce(maybe related, maybe not) For this one I need to find a good reproducible test case https://github.com/aloneguid/parquet-dotnet/issues/502