Open RuiFilipeCampos opened 1 year ago
I've iterated this idea a bit more. I want this whole thing to be largely invisible and seamless. Something you setup once and it just kinda works - git lfs is a good example.
The idea is to write an extension to git.
It will allow you to mark a given python file as an index to the dataset.
git datasets track index.py
This index.py file would have the schema class:
@dataset(remote="awsbucket")
class YourDataset:
name: str
age: int
Ignoring large files for now, say you commit the index.py file
git commit index.py
Before actually commiting it, this will actually run the script, the script enforces the schema onto the SQLite database. Some fields may be deleted, some added.
git push
Git push would push index.py to the repo and the SQLite database to AWS.
git pull
Git pull downloads the data from the remote.
Given the features that have already been described in the readme, and extending this concept to also support files indexed in the SQL database, creating and transforming a dataset has now become a simple code change + a git commit, git push.
Going back in the dataset would also be just a matter of checking out a previous commit, or a different branch. Deduplication would be in place on the remote via a simple checksum mechanism.
The one thing that is missing here, is the ability to add and remove data. Brainstorming:
@dataset(remote="bucket")
class YourDataset:
image: File
segmentation: File
@ingest
def get_dataset_from_web():
... # get data from the web
return list_image, list_segmentation
The promise of this whole thing is that for every commit, the index.py file guarantees the state of the dataset it describes. When this is committed, the method is run. Deduplication is in place for every row, so the method can be run twice and cause no harm, though it's advisable to remove it and commit after the first commit.
git commit --> python index.py --pre-commit
git pull
if data is missinggit push --> python index.py --push
git checkout --> python index.py --post-checkout
Having the script manipulate the data on each checkout might not be a good idea. So I'm thinking:
git pull --> python index.py --pull
For actually changing the data (deleting data, downloading data).
Local duplication can be avoided by keeping files in a cache folder and symlinking them.
index.py
file and write:
@dataset(remote="awsbucket") class SegmentationDataset: image: File[png | jpg] segmentation: File[png]
2. Track the dataset
```bash
git datasets track index.py
git commit index.py
This results in the creation of an sqlite database with the given schema and since it has two file fields, two new folders:
| image/
| segmentation/
| dataset.sqlite
| index.py
@dataset(remote="awsbucket")
class SegmentationDataset:
image: File[png | jpg]
segmentation: File[png]
@ingest
def get_dataset_from_web():
... # perform some web requests, save the files
return list_of_image, list_of_segmentation
git commit index.py
This will execute the get_dataset_from_web
method and backup the files in a cache.
@dataset(remote="awsbucket")
class SegmentationDataset:
image: File[png | jpg]
segmentation: File[png]
def segmentation_resized(segmentation: File[png]) -> File[png]:
...
return file
and then
git commit index.py
will alter the schema and perform the transformation, resulting in
| image/
| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py
@dataset(remote="awsbucket")
class SegmentationDataset:
segmentation: File[png]
def segmentation_resized(segmentation: File[png]) -> File[png]:
...
return file
and,
git commit index.py
results in
| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py
git reset --hard HEAD~1
resulting in
| image/
| segmentation/
| segmentation_resized/
| dataset.sqlite
| index.py
git push
The message here is: the only thing you truly need to care about is the index.py
file, comited index.py
file is guaranteed to describe the current state of database.sqlite
and the data folders.
high level stuff
database
Main use cases for this version:
points to think about/test when this version is finished
how will deletion be handled, data deletion, file deletion-dvc- will write my own versioning systemdeleting columns from sqlite3 is not possible- recent versions do, added apsw as a dependencydvc might be a good choice here- will write my own versioning systemkey decisions
no dependencies, only python std lib - ease of adoption, lightweight implementation, controlonly one dependency (apsw for a version of sqlite3 that supports dropping columns)two dependencies: apsw for sqlite3 with column deletion and dvc for version control1.x.x