Cinchoo / ChoETL

ETL framework for .NET (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
MIT License
746 stars 134 forks source link

ParquetWriter&Compression #298

Open S345T opened 8 months ago

S345T commented 8 months ago

Hi,

We're trying to convert JSON to Parquet with compression for one of our requirements. We found ChoETL to be very useful. We have a question regarding CompressionMethod. We took the Sample52.json message from the repo as an example to see if it suffices our requirement. The compression method we're looking at is Gzip.

What we found out, when we completely took off the CompressionMethod in the ParquetWriter, it was around 5.7 MB. But, with CompressionMethod, it was around 6.9 MB.

We tried adding a compression level too.

With a value of 8 as Compression Level, it was around 6.6 MB. Understand it's .3 MB less but, was looking far less than that when the message got compressed.

Just wondering if we're using the component the way it should be used or, if it's the best it can offer as it stands.

Another thing we didn't understand, without CompressionMethod, the size was less.

using (var r = ChoJSONReader.LoadText(requestBody) .UseJsonSerialization() .JsonSerializationSettings(s => s.DateParseHandling = DateParseHandling.None) .JsonSerializationSettings(s => s.NullValueHandling = NullValueHandling.Include) ) { using var parquetStream = new MemoryStream(); { using (var w = new ChoParquetWriter(parquetStream) .Configure(c => c.CompressionMethod = Parquet.CompressionMethod.Gzip) .ThrowAndStopOnMissingField(false) ) { w.Write(r); } }

Thanks in Advance.

Cinchoo commented 8 months ago

latest release https://www.nuget.org/packages/ChoETL.Parquet/1.0.1.30 offers more compression algo.

try and let me know.