[BUG]: Destination is too short. (Parameter 'destination') Error from 4.0.0 onwards for Specific Parquet Files

notruilin commented 11 months ago

Library Version

4.0.0

OS

Windows

OS Architecture

64 bit

How to reproduce?

Greetings!

I've come across a problem with reading a specific Parquet file using versions from 4.0.0 and onwards.

Destination is too short. (Parameter 'destination')
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data) in C:\dev\parquet-dotnet - Copy\src\Parquet\Encodings\ParquetPlainEncoder.cs:line 757
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead) in C:\dev\parquet-dotnet - Copy\src\Parquet\Encodings\ParquetPlainEncoder.cs:line 180
   at Parquet.File.DataColumnReader.ReadColumn(Span`1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc) in C:\dev\parquet-dotnet - Copy\src\Parquet\File\DataColumnReader.cs:line 271
   at Parquet.File.DataColumnReader.ReadDataPageV1Async(PageHeader ph, PackedColumn pc) in C:\dev\parquet-dotnet - Copy\src\Parquet\File\DataColumnReader.cs:line 196
   at Parquet.File.DataColumnReader.ReadAsync(CancellationToken cancellationToken) in C:\dev\parquet-dotnet - Copy\src\Parquet\File\DataColumnReader.cs:line 79
   at Parquet.ParquetReader.ReadEntireRowGroupAsync(Int32 rowGroupIndex) in C:\dev\parquet-dotnet - Copy\src\Parquet\ParquetReader.cs:line 151

Interestingly, this file was readable without issues before version 3.10.0. Additionally, the file can be successfully read by other libraries such as pyarrow in Python and org.apache.parquet, pointing towards its validity.

Due to internal constraints, I'm unable to share the sample file. But I've analyzed the problem and here's what I've found:

Source Data Reading: For PageHeader.Type equalling PageType.DATA_PAGE, the method Parquet.File.DataColumnReader.ReadDataPageV1Async calculates byte size based on ph.CompressedPageSize and ph.UncompressedPageSize.

Data Transfer from Source to Target: In Parquet.Encodings.ParquetPlainEncoder.Decode, the byte length is decided by _definedDataCount in PackedColumn, which refers to the row count in the metadata.

In my case, the uncompressed_page_size is longer than the actual row count

In version >= 4.0.0: Like the screenshot showing below:

When reading a column, the total byte length is identified as 15695.
Accounting for the 7 bytes used in "ReadLevels", we are left with a byte length of 15695 - 7 = 15688.
This would imply there are 1961 rows (15688/8). However, the file should only have 1960 rows.
The code used 1960 as the row count to initial the space, which is 1960 * 8 = 15680 bytes.
An error surfaces when trying to copy 15688 bytes into a space designed for just 15680 bytes.

In version 3.10.0, a protective measure exists:

The total byte length remains 15695 in 3.10.0.
The difference lies in the checks in place:
While the stream position s.Position being less than totalLength (15687 < 15695) suggests it hasn't reached the end of the stream,
An additional condition, idx < dest.Length, ensures only data corresponding to 1960 rows is processed, effectively bypassing any surplus data.

Questions:

Could this be a regression introduced in 4.0.0?
Or, should the file we're reading be considered "unclean"? I'm new to the parquet format, so I'm curious if there's a standard related to this concern.
Given the safety checks in older versions, could similar checks be reintroduced in the latest version?

Appreciate your assistance. Kindly let me know if more information is needed. Thanks!

Failing test

No response

murermader commented 11 months ago

I just wanted to add that I am experiencing the same issue with files created by pandas by using the function df.to_parquet(path, engine="fastparquet"). I am looking at these files using the program ParquetViewer, which crashes due to this error somewhere in Parquet.NET. The files can be read without problems in pandas or R, so I think the parquet files are not the problem.

This is the file (zipped) I am having issues with: df_fam_places.zip

Some info about the file:

>>> df.dtypes
ID            object
PLAC.TYPE     object
PLAC          object
LATI         float64
LONG         float64
dtype: object

Created using pandas to_parquet() from dictionaries in Python
last two columns contain only NaN values
can be read with R and Pandas, but not Parquet.NET 4.16.4

andreamari1993 commented 11 months ago

I found the same issue using the package. I'm using the version 4.16.4 on both OS (Windows and Linux). Below you can find the stacktrace.

An error occurred: System.ArgumentException: Destination is too short. (Parameter 'destination') at Parquet.Encodings.ParquetPlainEncoder.Decode(Span1 source, Span1 data) at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span1 source, Int32& elementsRead) at Parquet.File.DataColumnReader.ReadColumn(Span1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc) at Parquet.File.DataColumnReader.ReadDataPageV1Async(PageHeader ph, PackedColumn pc) at Parquet.File.DataColumnReader.ReadAsync(CancellationToken cancellationToken) at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetReader reader, Int32 rgi, Assembler1 asm, ICollection1 result, CancellationToken cancellationToken) at Parquet.Serialization.ParquetSerializer.DeserializeAsync[T](Stream source, ParquetOptions options, CancellationToken cancellationToken)

Is there a workaround? Let me know if you need other details reagarding the issue. Thanks, Andrea

aloneguid commented 10 months ago

This should be fixed in the latest release.

akaloshych84 commented 5 months ago

Doesn't seem to be fixed. I have a large file with a complex schema and deserialization failure for int64 and double columns. When I commented out class properties deserialization works.

aloneguid commented 5 months ago

Doesn't seem to be fixed. I have a large file with a complex schema and deserialization failure for int64 and double columns. When I commented out class properties deserialization works.

Then it's most probably miscommunication. I'd recommend raising a new issue and include the failing unit test so we are 100% clear what's not working.

akaloshych84 commented 5 months ago

Thank you so much for the quick reply!

I have a few issues with deserializer, I submitted another one which is easier to reproduce(maybe related, maybe not) For this one I need to find a good reproducible test case https://github.com/aloneguid/parquet-dotnet/issues/502

aloneguid / parquet-dotnet