RuiFilipeCampos / git-datasets

Declaratively create, transform, manage and version ML datasets.
4 stars 0 forks source link

v0.1.0-alpha -`git commit` #1

Open RuiFilipeCampos opened 1 year ago

RuiFilipeCampos commented 1 year ago

high level stuff


database


Main use cases for this version:

  1. Creating the sqlite from scratch
from datasets import dataset

@dataset(sql_file="dataset.sqlite")
class MyDataset:
    name: str
    age: int
  1. Support for extending it
@dataset(sql_file="dataset.sqlite")
class MyDataset:
    name: str
    age: int
    emotion: str
  1. Support for removal:
@dataset(sql_file="dataset.sqlite")
class MyDataset:
    age: int
    emotion: str

points to think about/test when this version is finished


key decisions

RuiFilipeCampos commented 1 year ago

I've iterated this idea a bit more. I want this whole thing to be largely invisible and seamless. Something you setup once and it just kinda works - git lfs is a good example.

The idea is to write an extension to git.

It will allow you to mark a given python file as an index to the dataset.

git datasets track index.py

This index.py file would have the schema class:

@dataset(remote="awsbucket")
class YourDataset:
    name: str
    age: int

Ignoring large files for now, say you commit the index.py file

git commit index.py

Before actually commiting it, this will actually run the script, the script enforces the schema onto the SQLite database. Some fields may be deleted, some added.

git push

Git push would push index.py to the repo and the SQLite database to AWS.

git pull

Git pull downloads the data from the remote.

Given the features that have already been described in the readme, and extending this concept to also support files indexed in the SQL database, creating and transforming a dataset has now become a simple code change + a git commit, git push.

Going back in the dataset would also be just a matter of checking out a previous commit, or a different branch. Deduplication would be in place on the remote via a simple checksum mechanism.

The one thing that is missing here, is the ability to add and remove data. Brainstorming:

@dataset(remote="bucket")
class YourDataset:
    image: File
    segmentation: File

    @ingest
    def get_dataset_from_web():
        ... # get data from the web
        return list_image, list_segmentation

The promise of this whole thing is that for every commit, the index.py file guarantees the state of the dataset it describes. When this is committed, the method is run. Deduplication is in place for every row, so the method can be run twice and cause no harm, though it's advisable to remove it and commit after the first commit.

RuiFilipeCampos commented 1 year ago
git commit --> python index.py --pre-commit
  1. Ensure data integrity by going over the checksums, may ask for git pull if data is missing
  2. Enforce the schema by: 1.1. Creating db if not there 1.2. Adding columns 1.3. Removing columns
  3. Transform data if any is required
  4. Move all new files to cache
git push --> python index.py --push
  1. Uploads all files in cache
git checkout --> python index.py --post-checkout

Having the script manipulate the data on each checkout might not be a good idea. So I'm thinking:

git pull --> python index.py --pull

For actually changing the data (deleting data, downloading data).

Local duplication can be avoided by keeping files in a cache folder and symlinking them.

RuiFilipeCampos commented 1 year ago
  1. Create an index.py file and write:

@dataset(remote="awsbucket") class SegmentationDataset: image: File[png | jpg] segmentation: File[png]


2. Track the dataset

```bash
git datasets track index.py
  1. Commit the file
git commit index.py

This results in the creation of an sqlite database with the given schema and since it has two file fields, two new folders:

| image/
| segmentation/
| dataset.sqlite
| index.py
  1. Write an ingest method to get data into the dataset
@dataset(remote="awsbucket")
class SegmentationDataset:
    image: File[png | jpg]
    segmentation: File[png]

    @ingest
    def get_dataset_from_web():
        ... # perform some web requests, save the files
        return list_of_image, list_of_segmentation
  1. Commit the changes
git commit index.py

This will execute the get_dataset_from_web method and backup the files in a cache.

  1. To perform transformations on the dataset:
@dataset(remote="awsbucket")
class SegmentationDataset:
    image: File[png | jpg]
    segmentation: File[png]

    def segmentation_resized(segmentation: File[png]) -> File[png]:
        ...
        return file

and then

git commit index.py

will alter the schema and perform the transformation, resulting in

| image/
| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py
  1. You can also remove fields
@dataset(remote="awsbucket")
class SegmentationDataset:
    segmentation: File[png]

    def segmentation_resized(segmentation: File[png]) -> File[png]:
        ...
        return file

and,

git commit index.py

results in

| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py
  1. But of course, you can go back
git reset --hard HEAD~1

resulting in

| image/
| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py
  1. Save the changes as you normally do, by pushing the commits
git push

The message here is: the only thing you truly need to care about is the index.py file, comited index.py file is guaranteed to describe the current state of database.sqlite and the data folders.