Open mlell opened 1 year ago
I learned from the documentation that the RIA upload works by chaining two remotes together via a publication dependency. Remote 1 is a normal git push to a bare git repo and Remote 2, which stores the large files, is a git annex special remote of the type "ora", which is an extension to git annex by datalad. Therefore I figured if I would create an ora special remote manually, I should be able to set it as a publication dependency manually for a RIA remote that is created via create-sibling-ria --no-storage-sibling
. I can then simply use the same ORA remote for all my datasets and voila, file-level deduplication.
However, I cannot create such a remote. In datalad I did not find a command is no command to do so (the only command with ora
in its name is export-archive-ora
) and with git annex I got
$ git annex initremote test type=ora
git-annex: Unknown remote type ora (pick from: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external)
But I checked using which
, the command git-annex-remote-ora
is in the $PATH
, I can also call it by its name alone:
$ git-annex-remote-ora
VERSION 1
^C
Traceback (most recent call last):
....
KeyboardInterrupt
FWIW, for a collection of code and/or containers which you reuse across different "data datasets", you can create a dedicated repository/dataset with those and include it as a subdataset within your "data datasets". That is e.g. how we do/recommend with use of https://github.com/ReproNim/containers .
As for initializing for an existing external ORA remote, try with type=external externaltype=ora
. But I guess you might also want/need to specify its UUID. We do smth like that for sharing rclone-based dropbox remote. Here is prototype bash script: https://github.com/dandi/dandisets/blob/10225719f0bf79758f4ce629ef23d098cf01380c/tools/cfg_dandi_backup#L11 and then code uses spec picked up from this yaml there: https://github.com/dandi/dandisets/blob/ad539eb9c827400a332c7619ec6eaf894f65d8aa/tools/backups2datalad.cfg.yaml#L5 .
What is the problem?
My datasets include the complete computation environment, that is, a container and the R package library, since it takes a long time if I have to re-install the packages every time when I clone the dataset. However, the package library is about 500MB and is largely (but not completely) similar in different datasets. I know from your help in the chat that a RIA store can access files in
annex/objects/(hash1)/(hash2)/
as well asarchives/archive.7z/(hash1)/(hash2)/
. This allows deduplication of data within a data set.Is there a way to have deduplication across data sets? For example, I know that git annex supports the
bup
special remote, which saves files in a dedicated git repository, but first splits the file into small chunks, which it connects via a git tree (a directory, if checked out). Is it possible (advisable) to use a bup special remote with a RIA store, or is there an easier solution (because then the safety of the files would rely on yet another program)?What steps will reproduce the problem?
No response
DataLad information
No response
Additional context
No response
Have you had any success using DataLad before?
I can manage my results quite nicely with it, even though I am always scared that I might loose files because I do not understand the complexity brought in by git annex....