delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.47k stars 1.68k forks source link

[Feature Request] Split Delta table in N chunks and only read 1 chunk - Deep Learning workloads #1623

Open QuentinAmbard opened 1 year ago

QuentinAmbard commented 1 year ago

Feature request

Overview

With this feature, we'd be able to split a Delta table into N chunk and ask the reader to only read a specific chunk. The reader would do its best to evenly split the dataset. As example, the reader could split the files into N buckets of even size, and only read a given bucket. The reader should do its best to avoid reading unnecessary data. Reading the same delta version with the same number of chunk would always return the same result, offering a guarantee that we can query all the data with multiple readers without duplicate nor gaps.

Motivation

This feature would be used to do distributed compute where each worker need to have access to a subset of the data. This is specifically true for DataScience workload & Deep learning training. A typical example would be having a Delta table with text / images being saved, and we want to be able to train a DL model (potentially distributed). With this feature, we'd be able to implement native Torch/tensorflow readers that would fetch only a subset of a Delta table in an efficient way.

Further details

Michael is solving this with an implementation leveraging an ID column to split the data against N worker: https://github.com/mshtelma/torchdelta Ideally, we'd like to replicate this behaviour natively within the reader, without leveraging extra column.

One naive solution could be listing the parquet files and doing the split manually, however this likely won't support protocol changes and more advanced features like deletion vectors

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

dennyglee commented 1 year ago

Tagging myself @dennyglee