A2-ai / dvs

Other
0 stars 0 forks source link

Use

dvs (data versioning system) is a file linker that allows teams to version files under Git without directly tracking them.

This R package allows teams to collaborate without uploading large or sensitive files to Git.

How it works

Instead of uploading data files to Git, a user can employ dvs, which copies the files to a shared storage directory and generates metadata files. The user can upload these metadata files to Git to make the versioned files accessible to collaborators.\ dvs will generate a .gitignore in the immediate directory of each versioned file excluding the versioned file and including its corresponding metadata file.

When collaborators pull from Git, they can employ dvs to parse the metadata files to locate each corresponding data file copy in the storage directory and copy them back to the project directory.

A dvs.yaml file is generated upon initialization in the project directory from which dvs parses the storage directory.

A .dvs metadata file is generated for each versioned file in its given directory.\ A versioned file's metadata file contains a hash of the versioned file's contents via the blake3 algorithm. \ This hash is used to both track the most current version of the file and create the path for a versioned file's copy in the storage directory.

Tutorial

See a detailed tutorial here.

Example Workflow

To add files to dvs:

Step 1: Initialize with dvs_init to set an accessible storage directory outside the git repo.

dvs_init("/data/dvs/storage_directory")

Output data frame:\

Screenshot 2024-05-14 at 3 25 53 PM

Step 2: Add files to the storage directory with dvs_add.

dvs_add("data.csv")

Output data frame:\

Screenshot 2024-05-14 at 3 26 38 PM

Step 3: Push to Git.


To get files from dvs:

Step 1: Pull from Git.

Step 2: Generate a report with dvs_status to view versioned files.

dvs_status()

Output data frame:\

Screenshot 2024-05-14 at 3 29 05 PM

Step 3: Get files from the storage directory with dvs_get.

dvs_get("data.csv")

Output data frame:\

Screenshot 2024-05-14 at 3 29 50 PM