JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
112 stars 32 forks source link

OOM When Reading "Short But Wide" (i.e., >100k columns, <1000 rows) partitioned data #171

Open CalvinLeather opened 1 year ago

CalvinLeather commented 1 year ago

Largely leaving this for others who may run into similar weird use cases and the resulting problems....

In genomics, we often have "short but wide" datasets due to the width of genomic information (e.g., 100 people's data for 1mm+ locations in their DNA). This library (like many other parquet libraries) may be having some issues with representing the metadata for data of this size (this is an odd abuse of parquet files, but is somewhat common in genomics, e.g. https://medium.com/23andme-engineering/genetic-datastore-4b213256db31)

Anyway, to recreate, take 2 or 3 parquet partitions with >100k float or int-typed columns, 100 rows and try calling read_parquet(path). This OOM'd for us on OSX Big Sur (will follow up with some more details). pyarrow loaded these partitions without (albeit really slowly). The partitions in our example were all <100Mb (and only a few partitions, on a machine with >32Gb of RAM). Parquet.File(filename) works just fine on a single partition (as stated below, this is an odd usecases so not going to dig much more yet)

So, I'm pretty sure something about loading in the header metadata for this width of data is causing some challenges. I'm not expecting that this weird edge case "short but wide" data would be accomodated, so probably not going to investigate in the source code yet, largely leaving this issue as a sign post for other genomics folks who were attempting to use, e.g., JWAS with this library and loaded in floating-point encoded genomic data.

Credit to @RyanGannon-Embark who found this mostly without me, I'm mostly the messenger here since I'm working on documenting our path forward.