aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

[Question] What is the best way to split a parquet file without loading all in memory? #415

Closed nicolaslattuada closed 6 months ago

nicolaslattuada commented 8 months ago

Issue description

Hi I have a parquet file with the following schema:

    ObjTwo  {
        Id guid
        Url string
        Title string
        Properties Map "Key values map of annotations"
    }

    ObjOne {
        Id   guid
        Name string
        Score double
        Metadata Map "Key values map of annotations"
    }

    Job {
      Name string
      Owner string
      OtherId guid
      Engine string  "Optional Default=undefined"
      EndTime datetime
      ObjTwos List "List of <ObjTwo>'s"
      ObjOnes List "List of <ObjOne>'s"
    }

Let's pretend ObjTwos and ObjOnes have each 20M records, and I cannot fit those lists in memory in my running environment.

I want to split it in 3 parquet files:

I am interested in having best performance and low memory footprint.

Is it possible to do this using parquet-dotnet, without having to load the data in memory? Thanks :)

aloneguid commented 8 months ago

I think you're looking for "row groups"?