aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

[BUG]: Can't read Parquet file with fixed_len_byte_array_column data type generated by Python #463

Closed AndrewDavidLees closed 5 months ago

AndrewDavidLees commented 5 months ago

Library Version

4.23.1 (and earlier)

OS

Windows

OS Architecture

64 bit

How to reproduce?

  1. Created a parquet file in Python (this could already be the source of the issue). See CreateParquetFile2.py. One column is "('fixed_len_byte_array_column', pa.binary(3))" generated using "'fixed_len_byte_array_column': [b'abc', b'def', b'ghi', b'jkl',b'mno', b'qrs']". See dataTypesExample.parquet
  2. Read file in Parquet.NET using C#. Current exception is: "'Specified argument was out of the range of valid values.'" at System.ThrowHelper.ThrowArgumentOutOfRangeException() at Parquet.Encodings.ParquetPlainEncoder.Decode(Span1 source, Span1 data) at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead) at Parquet.File.DataColumnReader.d12.MoveNext() at Parquet.File.DataColumnReader.d9.MoveNext() at Emb.PricingSuite.ProcessingEngine.Implementation.ParquetFileDataReader.d__53.MoveNext() in C:\Work\Radar4\Emb.PricingSuite.PE.Components.Data\src\ConnectionHandling\Parquet\ParquetFileDataReader.cs:line 325
  3. Previous exception in 4.10 complained about a startValue being out of range. I tried upgrading to see if it made any difference. BugReport.zip

Failing test

No response

aloneguid commented 5 months ago

Thanks for this. Looks like it's failing on decoding byte array length, which is encoded as a really large value:

image

What's worse, it's also failing to read in Apache Spark:

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))
    at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1317)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:191)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:269)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:209)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:173)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3(ParquetSchemaConverter.scala:138)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3$adapted(ParquetSchemaConverter.scala:108)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.immutable.Range.foreach(Range.scala:158)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertInternal(ParquetSchemaConverter.scala:108)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:78)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:577)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:577)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:557)
    at scala.collection.immutable.Stream.map(Stream.scala:418)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:557)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:549)
    at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    ... 1 more
aloneguid commented 5 months ago

I think i've fixed it:

image
aloneguid commented 5 months ago

fix released, please give it a go ;)

AndrewDavidLees commented 5 months ago

Wow, that's amazing, thanks Ivan! Seems to be working perfectly fine now :)

aloneguid commented 5 months ago

I'm glad to hear that everything is working perfectly now! If you found my assistance helpful, a star would be appreciated. Thank you! 😊