G-Research / ParquetSharp

ParquetSharp is a .NET library for reading and writing Apache Parquet files.
Apache License 2.0
183 stars 49 forks source link

[BUG]: Getting error while reading checkpoint parquet file #480

Closed shamimashik closed 3 months ago

shamimashik commented 3 months ago

Issue Description

While reading the attached checkpoint parquet file, I'm getting "Values and indices out of sync" exception. The issue seems to be occurring while reading MetaData struct. The checkpoint parquet looks okay as per the delta protocol's change metadata action definition: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-metadata   image

Checkpoint parquet file: 00000000000000000310.checkpoint.zip

  Stack trace: ERROR: System.Exception: Values and indices out of sync. at ParquetSharp.BufferedReader2.FillBuffer() in /workspaces/ParquetSharp/csharp/BufferedReader.cs:line 116 at ParquetSharp.BufferedReader2.ReadValue() in /workspaces/ParquetSharp/csharp/BufferedReader.cs:line 33 at ParquetSharp.LogicalBatchReader.LeafReader2.ReadBatch(Span1 destination) in /workspaces/ParquetSharp/csharp/LogicalBatchReader/LeafReader.cs:line 25 at ParquetSharp.LogicalColumnReader1.ReadBatch(Span1 destination) in /workspaces/ParquetSharp/csharp/LogicalColumnReader.cs:line 201 at ParquetSharp.LogicalColumnReader1.ReadBatch(TElement[] destination, Int32 start, Int32 length) in /workspaces/ParquetSharp/csharp/LogicalColumnReader.cs:line 194 at ParquetSharp.Test.LogicalValueGetter.OnLogicalColumnReader[TValue](LogicalColumnReader1 columnReader) in /workspaces/ParquetSharp/csharp.test/LogicalValueGetter.cs:line 22 at ParquetSharp.LogicalColumnReader1.Apply[TReturn](ILogicalColumnReaderVisitor1 visitor) in /workspaces/ParquetSharp/csharp/LogicalColumnReader.cs:line 155 at ParquetSharp.Test.TestParquetFileReader.TestReadFileCreateByPython() in /workspaces/ParquetSharp/csharp.test/TestParquetFileReader.cs:line 169 at ParquetSharp.Test.Program.Main() in /workspaces/ParquetSharp/csharp.test/Program.cs:line 21

Environment Information

Steps To Reproduce

  1. Create a codespace project
  2. Add the attached parquet file in your project - 00000000000000000310.checkpoint.zip
  3. Use existing test to read the parquet file. I used TestParquetFileReader.TestReadFileCreateByPython.
  4. Should throw an exception

Expected Behavior

Should be able to read the checkpoint parquet file without any exception.

Additional Context (Optional)

No response

adamreeve commented 3 months ago

Thanks for the bug report @shamimashik. I can reproduce the issue with the current master branch, and the file can be read by pyarrow 17, so it looks like there's a bug in how we handle nested values.

adamreeve commented 3 months ago

As a workaround, it looks like this file can be read if you enable wrapping nested data in the ParquetSharp.Nested type:

                         Console.WriteLine("  - repetition levels: {0}", ToString(repetitionLevels));
                     }

-                    using (var columnReader = rowGroupReader.Column(c).LogicalReader())
+                    using (var columnReader = rowGroupReader.Column(c).LogicalReader(useNesting: true))
                     {
                         var logicalValues = columnReader.Apply(new LogicalValueGetter(numRows));

This means that rather than metaData.id being read as an array of strings for example, you'll get an array of ParquetSharp.Nested<string>, where a null value means the enclosing metaData object is null.

shamimashik commented 3 months ago

@adamreeve thanks for the fix. when can we expect a new release with this fix?

adamreeve commented 3 months ago

This will go out with the 17.0.0 release, which might be about a month away. I could release a beta version earlier if that would help you though?

shamimashik commented 3 months ago

I see. Yes, a beta release will help as well. @adamreeve

shamimashik commented 3 months ago

Hi @adamreeve, is it possible to give an ETA on the beta release? TIA.

adamreeve commented 3 months ago

I've just released version 17.0.0-beta1 now @shamimashik