Cinchoo / ChoETL

ETL framework for .NET (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
MIT License
746 stars 134 forks source link

New Parquet.Net default encoding DELTA_BINARY_PACKED is causing issues with Spark #299

Closed jsn-m closed 8 months ago

jsn-m commented 8 months ago

As reported in the Parquet.Net Repo, the new DELTA_BINARY_PACKED encoding in Parquet.Net does not play well with Spark 3.3. Please expose the Parquet.Net UseDeltaBinaryPackedEncoding flag in ParquetOptions for setting.

https://aloneguid.github.io/parquet-dotnet/encodings.html#numbers

Cinchoo commented 8 months ago

All parquet options can be controlled as below

            using (var r = new ChoParquetReader(@"test1.parquet")
                .ParquetOptions(o => o.UseDeltaBinaryPackedEncoding = true))
            {

            }
jsn-m commented 8 months ago

I can't use LINQ because my ChoParquetWriter is declared dynamic because I needed to access the WithField method.

var genericParquetWriter = typeof(ChoParquetWriter<>).MakeGenericType(dtoType);
dynamic writerInstance = ChoActivator.CreateInstance(genericParquetWriter, new object[] { localPath, });
var dtoProperties = dtoType.GetAllProperties();

writerInstance.WithField(name: "SomeName", fieldType: typeof(string));
jsn-m commented 8 months ago

Figured it out:

var setParquetOptions = new Action<ParquetOptions>(s => s.UseDeltaBinaryPackedEncoding = false);
writerInstance.ParquetOptions(setParquetOptions);