v0.1.0-alpha -`git commit`

RuiFilipeCampos commented 1 year ago

high level stuff

[x] #4
[ ] #5

database

[ ] #6
[ ] #8

Main use cases for this version:

Creating the sqlite from scratch

from datasets import dataset

@dataset(sql_file="dataset.sqlite")
class MyDataset:
    name: str
    age: int

Support for extending it

@dataset(sql_file="dataset.sqlite")
class MyDataset:
    name: str
    age: int
    emotion: str

Support for removal:

@dataset(sql_file="dataset.sqlite")
class MyDataset:
    age: int
    emotion: str

points to think about/test when this version is finished

~~how will deletion be handled, data deletion, file deletion~~ - ~~dvc~~ - will write my own versioning system
~~deleting columns from sqlite3 is not possible~~ - recent versions do, added apsw as a dependency
will this scale for large datasets ? - the worry is the sqlite3 db
would the planned auto-commit feature work well ? - ~~dvc might be a good choice here~~ - will write my own versioning system

key decisions

~~no dependencies, only python std lib - ease of adoption, lightweight implementation, control~~
~~only one dependency (apsw for a version of sqlite3 that supports dropping columns)~~
~~two dependencies: apsw for sqlite3 with column deletion and dvc for version control~~
one dependency: apsw for sqlite3 with column deletion (decided to write my own data version system)
semantic versioning, trunk based development after a couple versions
compatibility for python 3.13 upwards - I'm anticipating some time will pass before I reach 1.x.x

RuiFilipeCampos commented 1 year ago

I've iterated this idea a bit more. I want this whole thing to be largely invisible and seamless. Something you setup once and it just kinda works - git lfs is a good example.

The idea is to write an extension to git.

It will allow you to mark a given python file as an index to the dataset.

git datasets track index.py

This index.py file would have the schema class:

@dataset(remote="awsbucket")
class YourDataset:
    name: str
    age: int

Ignoring large files for now, say you commit the index.py file

git commit index.py

Before actually commiting it, this will actually run the script, the script enforces the schema onto the SQLite database. Some fields may be deleted, some added.

git push

Git push would push index.py to the repo and the SQLite database to AWS.

git pull

Git pull downloads the data from the remote.

Given the features that have already been described in the readme, and extending this concept to also support files indexed in the SQL database, creating and transforming a dataset has now become a simple code change + a git commit, git push.

Going back in the dataset would also be just a matter of checking out a previous commit, or a different branch. Deduplication would be in place on the remote via a simple checksum mechanism.

The one thing that is missing here, is the ability to add and remove data. Brainstorming:

~~serve a web based spreadsheet app~~ - this would be just one more external component, and it wouldn't be flexible enough
~~cmd utility that transforms certain formats into datasets with a index.py (datasets import path/to.csv)~~ - not flexible enough
An ingest method that gets executed on commit

@dataset(remote="bucket")
class YourDataset:
    image: File
    segmentation: File

    @ingest
    def get_dataset_from_web():
        ... # get data from the web
        return list_image, list_segmentation

The promise of this whole thing is that for every commit, the index.py file guarantees the state of the dataset it describes. When this is committed, the method is run. Deduplication is in place for every row, so the method can be run twice and cause no harm, though it's advisable to remove it and commit after the first commit.

RuiFilipeCampos commented 1 year ago

git commit --> python index.py --pre-commit

Ensure data integrity by going over the checksums, may ask for git pull if data is missing
Enforce the schema by: 1.1. Creating db if not there 1.2. Adding columns 1.3. Removing columns
Transform data if any is required
Move all new files to cache

git push --> python index.py --push

Uploads all files in cache

git checkout --> python index.py --post-checkout

Having the script manipulate the data on each checkout might not be a good idea. So I'm thinking:

git pull --> python index.py --pull

For actually changing the data (deleting data, downloading data).

Local duplication can be avoided by keeping files in a cache folder and symlinking them.

RuiFilipeCampos commented 1 year ago

Create an index.py file and write:

@dataset(remote="awsbucket") class SegmentationDataset: image: File[png | jpg] segmentation: File[png]


2. Track the dataset

```bash
git datasets track index.py

Commit the file

git commit index.py

This results in the creation of an sqlite database with the given schema and since it has two file fields, two new folders:

| image/
| segmentation/
| dataset.sqlite
| index.py

Write an ingest method to get data into the dataset

@dataset(remote="awsbucket")
class SegmentationDataset:
    image: File[png | jpg]
    segmentation: File[png]

    @ingest
    def get_dataset_from_web():
        ... # perform some web requests, save the files
        return list_of_image, list_of_segmentation

Commit the changes

git commit index.py

This will execute the get_dataset_from_web method and backup the files in a cache.

To perform transformations on the dataset:

@dataset(remote="awsbucket")
class SegmentationDataset:
    image: File[png | jpg]
    segmentation: File[png]

    def segmentation_resized(segmentation: File[png]) -> File[png]:
        ...
        return file

and then

git commit index.py

will alter the schema and perform the transformation, resulting in

| image/
| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py

You can also remove fields

@dataset(remote="awsbucket")
class SegmentationDataset:
    segmentation: File[png]

    def segmentation_resized(segmentation: File[png]) -> File[png]:
        ...
        return file

and,

git commit index.py

results in

| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py

But of course, you can go back

git reset --hard HEAD~1

resulting in

| image/
| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py

Save the changes as you normally do, by pushing the commits

git push

The message here is: the only thing you truly need to care about is the index.py file, comited index.py file is guaranteed to describe the current state of database.sqlite and the data folders.

RuiFilipeCampos / git-datasets

v0.1.0-alpha -`git commit` #1