This adds support for reading datasets. Datasets are partitioned parquet parquet files in a directory tree, with the directory structure encoding the partitioning scheme. E.g.:
The read_parquet method can be provided with the path to the dataset directory to load it, so read_parquet will return a Parquet.Table when reading a single file and a Parquet.Dataset when reading a dataset. Parquet.Dataset is a Tables.jl compliant table. Iterating over partitions of a dataset yields a Parquet.Table representing each partition in every iteration. If a filter function is provided while loading a dataset, it will be used to select only a subset of partitions based on the path to the partition file.
The schema for a dataset is read from a metadata file (_common_metadata or _metadata) if available in the dataset directory. If not available, the schema of any one of the parquet files in the dataset is taken up as the dataset schema. All partitions in a dataset must have the same schema.
This adds support for reading datasets. Datasets are partitioned parquet parquet files in a directory tree, with the directory structure encoding the partitioning scheme. E.g.:
The
read_parquet
method can be provided with the path to the dataset directory to load it, soread_parquet
will return aParquet.Table
when reading a single file and aParquet.Dataset
when reading a dataset.Parquet.Dataset
is a Tables.jl compliant table. Iterating over partitions of a dataset yields aParquet.Table
representing each partition in every iteration. If a filter function is provided while loading a dataset, it will be used to select only a subset of partitions based on the path to the partition file.The schema for a dataset is read from a metadata file (
_common_metadata
or_metadata
) if available in the dataset directory. If not available, the schema of any one of the parquet files in the dataset is taken up as the dataset schema. All partitions in a dataset must have the same schema.