aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

ParquetReader.RowGroups property #509

Closed danielearwicker closed 1 month ago

danielearwicker commented 2 months ago

Row group readers are created up-front and held in a list, so the OpenRowGroupReader method is just a wrapper around list[n].

There is a note in the reader's Dispose method that says that although dispose is not required, it may be in the future so clients so clients have to take on extra ceremony. But the current design is super fast and efficient even with very large numbers of row groups, so it shouldn't be necessary.

By clarifying that Dispose isn't required, this enables a simple ParquetReader.RowGroups property that exposes a read-only list of row group readers. This then enables Linq chained expressions to operate on row groups, use Reverse to read from the end, Where to filter on column stats, and ToAsyncEnumerable().SelectManyAwait to deserialise into a stream of objects. Can do this with multiple sorted parquet sources and then merge them with SortedMergeBy, before batching them back into row groups and directing them to a destination stream. With the right extension methods a whole operation like this could be written declaratively.

Rather than causing a breaking change to existing clients I've added an interface that omits Dispose but retained it on the class, so the RowGroups property (which uses the non-disposable interface) makes it clear that disposal is unnecessary.

danielearwicker commented 2 months ago

Noticed that the macOS tests are failing right now, no Apple M1 binary in a certain .jar. I did a super-hacky fix for it:

jar uf ../parquet-tools/parquet-tools-1.9.0.jar org/xerial/snappy/native/Mac/aarch64/libsnappyjava.dylib

i.e. add the missing binary to the existing jar. It now appears to work, except on 3.1 where there's a separate issue possible because that only has X64 which isn't being installed?

aloneguid commented 1 month ago

the build should be OK now, just merging here from master...

aloneguid commented 1 month ago

The build succeeds now. I have no issues accepting this, however you might want to add tests with use cases to guarantee this interface is not forgotten and removed during next refactoring session? Custom .jar can be removed as well.

aloneguid commented 1 month ago

I like the .jar hack though ;) But will build latest from official sources later.

danielearwicker commented 1 month ago

Cool, I've added a simple test for the property and reverted the .jar change.

aloneguid commented 1 month ago

pusing to pre-release -pre.6 to be included in 4.24