datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

[Question] Deduplicate/compress data across datasets in a RIA store #46

Open mlell opened 1 year ago

mlell commented 1 year ago

What is the problem?

My datasets include the complete computation environment, that is, a container and the R package library, since it takes a long time if I have to re-install the packages every time when I clone the dataset. However, the package library is about 500MB and is largely (but not completely) similar in different datasets. I know from your help in the chat that a RIA store can access files in annex/objects/(hash1)/(hash2)/ as well as archives/archive.7z/(hash1)/(hash2)/. This allows deduplication of data within a data set.

Is there a way to have deduplication across data sets? For example, I know that git annex supports the bup special remote, which saves files in a dedicated git repository, but first splits the file into small chunks, which it connects via a git tree (a directory, if checked out). Is it possible (advisable) to use a bup special remote with a RIA store, or is there an easier solution (because then the safety of the files would rely on yet another program)?

What steps will reproduce the problem?

No response

DataLad information

No response

Additional context

No response

Have you had any success using DataLad before?

I can manage my results quite nicely with it, even though I am always scared that I might loose files because I do not understand the complexity brought in by git annex....

mlell commented 1 year ago

I learned from the documentation that the RIA upload works by chaining two remotes together via a publication dependency. Remote 1 is a normal git push to a bare git repo and Remote 2, which stores the large files, is a git annex special remote of the type "ora", which is an extension to git annex by datalad. Therefore I figured if I would create an ora special remote manually, I should be able to set it as a publication dependency manually for a RIA remote that is created via create-sibling-ria --no-storage-sibling. I can then simply use the same ORA remote for all my datasets and voila, file-level deduplication.

However, I cannot create such a remote. In datalad I did not find a command is no command to do so (the only command with ora in its name is export-archive-ora) and with git annex I got

$ git annex initremote test type=ora
git-annex: Unknown remote type ora (pick from: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external)

But I checked using which, the command git-annex-remote-ora is in the $PATH, I can also call it by its name alone:

$ git-annex-remote-ora
VERSION 1
^C
Traceback (most recent call last):
....
KeyboardInterrupt
yarikoptic commented 1 year ago

FWIW, for a collection of code and/or containers which you reuse across different "data datasets", you can create a dedicated repository/dataset with those and include it as a subdataset within your "data datasets". That is e.g. how we do/recommend with use of https://github.com/ReproNim/containers .

As for initializing for an existing external ORA remote, try with type=external externaltype=ora. But I guess you might also want/need to specify its UUID. We do smth like that for sharing rclone-based dropbox remote. Here is prototype bash script: https://github.com/dandi/dandisets/blob/10225719f0bf79758f4ce629ef23d098cf01380c/tools/cfg_dandi_backup#L11 and then code uses spec picked up from this yaml there: https://github.com/dandi/dandisets/blob/ad539eb9c827400a332c7619ec6eaf894f65d8aa/tools/backups2datalad.cfg.yaml#L5 .