aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
600 stars 151 forks source link

Support Dictionary encoding for more types. #531

Open EamonHetherton opened 2 months ago

EamonHetherton commented 2 months ago

Issue description

Currently only string columns are considered for dictionary encoding. A lot of the data that I work with has very high repetition in other data typed columns (int, decimal and datetime mostly). I did a small spike to investigate the benefit of dictionary encoding these and the results were very encouraging, typically around 50x reduction in size when not using any compression. Whilst compression does help somewhat to reduce the scale of the difference, even still the snappy compressed version ended up 5x smaller with the additional types being dictionary encoded.

Of particular interest to me is the decimal datatype which takes 16 bytes in PLAIN encoding, but in a lot of my cases there are fewer than 30,000 distinct values so even in the degenerate case of all run lengths of 1, this would only be 3 bytes per value.

I'm happy to do the work and make the pull request (it's a pretty small change overall I believe), just wanted to understand if there was any other reason this has not been implemented to date?