aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

Allow registering custom types for class serialization #466

Open pkese opened 5 months ago

pkese commented 5 months ago

Issue description

Hi,

I'm using F# which has a few extra built-in container types, like immutable records, tuples, linked lists, optionals etc. for which Parquet.Net's class serialization doesn't work at all.
It would be nice if Parquet.Net would be able to support those extra container types.

I'm not expecting Parquet.Net to add support for these extra types - it's probably beyond the scope for this library - however it would be nice if Parquet.Net would be flexible enough as to allow people to implement such things themselves.

It could be done in a similar fashion as how System.Text.Json allows for registering additional domain types and letting people provide their own extensions, e.g. FSharp.SystemTextJson.

So this ticket is a humble request to provide functionality for extending Parquet.Net in a similar (but not necessarily the same) manner as JsonConverterFactory allows for extending System.Text.Json with custom container types.

I'm sure there will be other users of such APIs, not just F# folks.

Thanks.

mmport80 commented 4 months ago

I am also trying to use this library with F# (it doesn't write out correct Parquet files (manual or auto serilisation) no footer afaict).

If you have a moment, ow do you go about it, what would you recommend or avoid?

Or even better a quick gist, just to show me something is possible..

mmport80 commented 4 months ago

OK, I finally got it working with a manually dispose call. I am not a .Net / F# person. so maybe this is only confusing to me.


    use fileStream = File.Create(filePath)
    let! parquetWriter = ParquetWriter.CreateAsync(schema, fileStream) |> Async.AwaitTask
    parquetWriter.CompressionMethod <- CompressionMethod.Gzip
    parquetWriter.CompressionLevel <- System.IO.Compression.CompressionLevel.Optimal

    try
        use rowGroupWriter = parquetWriter.CreateRowGroup()

        for dataColumn in dataColumns do
            printfn "debug: %s" dataColumn.Field.Name
            do! rowGroupWriter.WriteColumnAsync(dataColumn) |> Async.AwaitTask

        printfn "Successfully wrote to Parquet file: %s" filePath
    finally
        // Explicitly call Dispose if necessary; generally not required with 'use'
        parquetWriter.Dispose()
pkese commented 4 months ago

rather than async {...} you use task {...} and then your code becomes:

    task {
        use fileStream = File.Create(filePath)
        use! parquetWriter = ParquetWriter.CreateAsync(schema, fileStream)
        parquetWriter.CompressionMethod <- CompressionMethod.Gzip
        parquetWriter.CompressionLevel <- System.IO.Compression.CompressionLevel.Optimal

        use rowGroupWriter = parquetWriter.CreateRowGroup()

        for dataColumn in dataColumns do
            printfn "debug: %s" dataColumn.Field.Name
            do! rowGroupWriter.WriteColumnAsync(dataColumn)

        printfn "Successfully wrote to Parquet file: %s" filePath
        return ()
    }

I myself ended up sticking with json (I wanted to improve my data pipeline with parquet, but then didn't find time for it)