JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
112 stars 32 forks source link

support for reading partitioned datasets #138

Closed tanmaykm closed 3 years ago

tanmaykm commented 3 years ago

This adds support for reading datasets. Datasets are partitioned parquet parquet files in a directory tree, with the directory structure encoding the partitioning scheme. E.g.:

dataset/_common_metadata
dataset/year=2020/registered=False/part1.parquet
dataset/year=2020/registered=True/part2.parquet
dataset/year=2021/registered=False/part3.parquet
dataset/year=2021/registered=True/part4.parquet

The read_parquet method can be provided with the path to the dataset directory to load it, so read_parquet will return a Parquet.Table when reading a single file and a Parquet.Dataset when reading a dataset. Parquet.Dataset is a Tables.jl compliant table. Iterating over partitions of a dataset yields a Parquet.Table representing each partition in every iteration. If a filter function is provided while loading a dataset, it will be used to select only a subset of partitions based on the path to the partition file.

julia> using Parquet, DataFrames

julia> DataFrame(read_parquet("dataset/"; filter=(path)->occursin("bool=false", lowercase(path))))
100×12 DataFrame
 Row │ int32        int64                 float32
     │ Int32?       Int64?                Float32?
─────┼────────────────────────────────────────────
   1 │   753264708    665618051800910486  0.563551
   2 │ -1581958242   -361882957099568249  0.770138
...

The schema for a dataset is read from a metadata file (_common_metadata or _metadata) if available in the dataset directory. If not available, the schema of any one of the parquet files in the dataset is taken up as the dataset schema. All partitions in a dataset must have the same schema.