aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
615 stars 152 forks source link

[BUG]: Table.WriteAsync does not handle empty lists correctly #400

Closed WoMayr closed 5 months ago

WoMayr commented 1 year ago

Library Version

4.16.4

OS

Windows 10

OS Architecture

64 bit

How to reproduce?

  1. Checkout solution from this repo: https://github.com/WoMayr/ParquetTableWriteIssue
  2. Change basePath variable im Program.cs to something that exists on your machine
  3. Run it
  4. When opening "outputTable.parquee" using Parquet Viewer the employee-arrays spill into rows with empty arrays: image as comparisson this does not happen when using class serialization image

I also tried opening the file with DBeaver using the DuckDB driver and get the following error: "SQL Error: Invalid Input Error: Mismatch in parquet read for column 6, expected 1000 rows, got 665"

Failing test

https://github.com/WoMayr/ParquetTableWriteIssue
aloneguid commented 1 year ago

Table API will be deprecated soon. Are you able to use class serialisation?

WoMayr commented 1 year ago

Oh.. good to know. Unfortunately class serialization is not an option for my use case.

aloneguid commented 1 year ago

Fair enough. I'm trying to rewrite row API to use the same engine as class serialization as it follows parquet specification 100% but it's a long shot. It is a priority for me though.

tim-white-waters commented 1 year ago

Is there a workaround? Files generated using this method also fail to read using s3 object select (this is crucial to our ETL pipelines). Class based serialization is not an option - we have dynamic schemas that get determined at run time.

tim-white-waters commented 1 year ago

Were now looking at serializing this to JSON - not my favorite approach.

aloneguid commented 1 year ago

Sorry as of now row API development is frozen unless you contribute with a PR. The best option would be to resort to low level API which is as flexible as it gets, but needs special handling for lists . There are plenty of examples though.

aloneguid commented 5 months ago

Closing due to inactivity and deprecation of row api soon.