apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
908 stars 292 forks source link

[EPIC] Gravitino Datasets library #4104

Open jiwq opened 2 months ago

jiwq commented 2 months ago

What’s Dataset?

Describe the proposal

Datassets is a library for easily accessing and sharing Tabular structured data and data sets for non-Tabular audio, computer vision, and natural language processing (NLP) tasks.

For training a deep learning model, the dataset may be split to train and test. In general, the training dataset is used in the training stage and the test dataset is used in the eval stage.

1.1 Dataset Object

There are two types of dataset objects, a regular Dataset and then an IterableDataset. A Dataset provides fast random access to the rows, and memory mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that won’t even fit on disk or in memory, an IterableDataset allows you to access and use the dataset without waiting for it to download completely!

Split dataset represents a dictionary, the key is the split name and the value is the Dataset object.

1.2 Split

As described above, the datasets are typically split into different sub-datasets to be used at various stages for model training. Such as: training, testing and evaluation.

2. Create Dataset

Before supporting these features, Gravitino should support the meta management for model training and access control features. The following feature design is based on the above assumptions.

image

3. Load Dataset

Wherever a dataset is stored, the Gravitino Datasets should help the user to load it from Apache Gravitino. So we propose the architecture for loading datasets in the Gravitino Datasets library as outlined below:

image

3.1 Catalog

Load the dataset from Gravitino should use the granted token. Gravitino Datasets library gets the metadata from Gravitino and generates the sub-dataset for the user.

image

Design Document

  1. https://docs.google.com/document/d/1_gMfkiwc4T56xtE0ZRpla_yD09hqf2MSHKAsWbK-eSc/edit
  2. https://docs.google.com/document/d/1NdHc52U6tW9acHNWOfGiCEr08XO6VlcHf1q-n8mD60w/edit

Task list

zuston commented 2 months ago

This looks a good try to make the AI features be scoped in the gravitino management. I'm not sure whether the features that used in realtime/offline batch/once-time inference should be involved in this design like the feature store did?

I lack certain background knowledge about this design, feel free to point out if I'm wrong.

jiwq commented 1 month ago

This looks a good try to make the AI features be scoped in the gravitino management. I'm not sure whether the features that used in realtime/offline batch/once-time inference should be involved in this design like the feature store did?

I lack certain background knowledge about this design, feel free to point out if I'm wrong.

@zuston Sorry for the later reply. We store the dataset not the features and discussed many times in offline meeting. We will start the coding and give the POC ASAP, so that everyone can understand it better.

iodone commented 1 month ago

@jiwq Does fsspec consider using https://github.com/fsspec/opendalfs to implement various storage APIs? Add an opendal layer:

image