aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

Feature/add row group serialization #506

Closed piiertho closed 1 month ago

piiertho commented 2 months ago

At my company we need to have more granularity regarding serialization and deserialization of row groups.
We use pools of objects to avoid instantiations and we multi-thread modifications of objects in those pools and parquet row group writing using DoubleBuffer (which use the mentionned pools).
This way we have fast and memory efficient parquet jobs.

We were using version 3 of this nuget, using reflection to access private methods of ClrBridge class to get fast and memory efficient serialization.
As serialization API implementation changed a lot, we cannot achieve the same on version 4.
So here is our contribution.

This adds a method to serialize a collection into a single row group. This adds methods to deserialize a single row group into an existing collection. This adds methods to deserialize row group per row group using IAsyncEnumerable.

piiertho commented 1 month ago

Checks are not passing on macOS because GitHub changed their runners to silicon. Setting a fixed runner version should fix this.