Closed notruilin closed 10 months ago
I just wanted to add that I am experiencing the same issue with files created by pandas by using the function df.to_parquet(path, engine="fastparquet")
. I am looking at these files using the program ParquetViewer, which crashes due to this error somewhere in Parquet.NET. The files can be read without problems in pandas or R, so I think the parquet files are not the problem.
This is the file (zipped) I am having issues with: df_fam_places.zip
Some info about the file:
>>> df.dtypes
ID object
PLAC.TYPE object
PLAC object
LATI float64
LONG float64
dtype: object
I found the same issue using the package. I'm using the version 4.16.4 on both OS (Windows and Linux). Below you can find the stacktrace.
An error occurred: System.ArgumentException: Destination is too short. (Parameter 'destination')
at Parquet.Encodings.ParquetPlainEncoder.Decode(Span1 source, Span
1 data)
at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span1 source, Int32& elementsRead) at Parquet.File.DataColumnReader.ReadColumn(Span
1 src, Encoding encoding, Int64 totalValuesInChunk, Int32 totalValuesInPage, PackedColumn pc)
at Parquet.File.DataColumnReader.ReadDataPageV1Async(PageHeader ph, PackedColumn pc)
at Parquet.File.DataColumnReader.ReadAsync(CancellationToken cancellationToken)
at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetReader reader, Int32 rgi, Assembler1 asm, ICollection
1 result, CancellationToken cancellationToken)
at Parquet.Serialization.ParquetSerializer.DeserializeAsync[T](Stream source, ParquetOptions options, CancellationToken cancellationToken)
Is there a workaround? Let me know if you need other details reagarding the issue. Thanks, Andrea
This should be fixed in the latest release.
Doesn't seem to be fixed. I have a large file with a complex schema and deserialization failure for int64 and double columns. When I commented out class properties deserialization works.
Doesn't seem to be fixed. I have a large file with a complex schema and deserialization failure for int64 and double columns. When I commented out class properties deserialization works.
Then it's most probably miscommunication. I'd recommend raising a new issue and include the failing unit test so we are 100% clear what's not working.
Thank you so much for the quick reply!
I have a few issues with deserializer, I submitted another one which is easier to reproduce(maybe related, maybe not) For this one I need to find a good reproducible test case https://github.com/aloneguid/parquet-dotnet/issues/502
Library Version
4.0.0
OS
Windows
OS Architecture
64 bit
How to reproduce?
Greetings!
I've come across a problem with reading a specific Parquet file using versions from 4.0.0 and onwards.
Interestingly, this file was readable without issues before version 3.10.0. Additionally, the file can be successfully read by other libraries such as
pyarrow
in Python andorg.apache.parquet
, pointing towards its validity.Due to internal constraints, I'm unable to share the sample file. But I've analyzed the problem and here's what I've found:
Source Data Reading: For
PageHeader.Type
equallingPageType.DATA_PAGE
, the methodParquet.File.DataColumnReader.ReadDataPageV1Async
calculates byte size based onph.CompressedPageSize
andph.UncompressedPageSize
.Data Transfer from Source to Target: In
Parquet.Encodings.ParquetPlainEncoder.Decode
, the byte length is decided by_definedDataCount
inPackedColumn
, which refers to the row count in the metadata.In my case, the
uncompressed_page_size
is longer than the actual row countIn version >= 4.0.0: Like the screenshot showing below:
In version 3.10.0, a protective measure exists:
s.Position
being less thantotalLength
(15687 < 15695) suggests it hasn't reached the end of the stream,idx < dest.Length
, ensures only data corresponding to 1960 rows is processed, effectively bypassing any surplus data.Questions:
Appreciate your assistance. Kindly let me know if more information is needed. Thanks!
Failing test
No response