Closed shamimashik closed 3 months ago
Thanks for the bug report @shamimashik. I can reproduce the issue with the current master branch, and the file can be read by pyarrow 17, so it looks like there's a bug in how we handle nested values.
As a workaround, it looks like this file can be read if you enable wrapping nested data in the ParquetSharp.Nested
type:
Console.WriteLine(" - repetition levels: {0}", ToString(repetitionLevels));
}
- using (var columnReader = rowGroupReader.Column(c).LogicalReader())
+ using (var columnReader = rowGroupReader.Column(c).LogicalReader(useNesting: true))
{
var logicalValues = columnReader.Apply(new LogicalValueGetter(numRows));
This means that rather than metaData.id
being read as an array of strings for example, you'll get an array of ParquetSharp.Nested<string>
, where a null value means the enclosing metaData
object is null.
@adamreeve thanks for the fix. when can we expect a new release with this fix?
This will go out with the 17.0.0 release, which might be about a month away. I could release a beta version earlier if that would help you though?
I see. Yes, a beta release will help as well. @adamreeve
Hi @adamreeve, is it possible to give an ETA on the beta release? TIA.
I've just released version 17.0.0-beta1 now @shamimashik
Issue Description
While reading the attached checkpoint parquet file, I'm getting "Values and indices out of sync" exception. The issue seems to be occurring while reading MetaData struct. The checkpoint parquet looks okay as per the delta protocol's change metadata action definition: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-metadata
Checkpoint parquet file: 00000000000000000310.checkpoint.zip
Stack trace:
ERROR: System.Exception: Values and indices out of sync. at ParquetSharp.BufferedReader
2.FillBuffer() in /workspaces/ParquetSharp/csharp/BufferedReader.cs:line 116 at ParquetSharp.BufferedReader2.ReadValue() in /workspaces/ParquetSharp/csharp/BufferedReader.cs:line 33 at ParquetSharp.LogicalBatchReader.LeafReader
2.ReadBatch(Span1 destination) in /workspaces/ParquetSharp/csharp/LogicalBatchReader/LeafReader.cs:line 25 at ParquetSharp.LogicalColumnReader
1.ReadBatch(Span1 destination) in /workspaces/ParquetSharp/csharp/LogicalColumnReader.cs:line 201 at ParquetSharp.LogicalColumnReader
1.ReadBatch(TElement[] destination, Int32 start, Int32 length) in /workspaces/ParquetSharp/csharp/LogicalColumnReader.cs:line 194 at ParquetSharp.Test.LogicalValueGetter.OnLogicalColumnReader[TValue](LogicalColumnReader1 columnReader) in /workspaces/ParquetSharp/csharp.test/LogicalValueGetter.cs:line 22 at ParquetSharp.LogicalColumnReader
1.Apply[TReturn](ILogicalColumnReaderVisitor1 visitor) in /workspaces/ParquetSharp/csharp/LogicalColumnReader.cs:line 155 at ParquetSharp.Test.TestParquetFileReader.TestReadFileCreateByPython() in /workspaces/ParquetSharp/csharp.test/TestParquetFileReader.cs:line 169 at ParquetSharp.Test.Program.Main() in /workspaces/ParquetSharp/csharp.test/Program.cs:line 21
Environment Information
Steps To Reproduce
Expected Behavior
Should be able to read the checkpoint parquet file without any exception.
Additional Context (Optional)
No response