datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

Concepts and terminology #98

Open mih opened 9 months ago

mih commented 9 months ago

This has never been defined properly. I think we should do it and put it in the docs. At minimum this needs to cover:

RIA stands for "Remote Indexed Archive" -- that we know. I cannot remember what ORA stands for. A RIA store is typically referring to the particular (filesystem) structure described in the FAIRLY-big paper and the handbook chapter.

Broader perspective

RIA is a data structure where the location of all components is based on identifiers. Identifiers that are always available in any datalad dataset: DataLad dataset ID and annex key.

Things that can be put into a RIA store are: base Git repositories, annex keys/objects.

A RIA store takes the form of a directory tree on the file system, where some parts of the name are the respective identifiers.

The directory tree is organized (at the top-level) as a collections of per-single-dataset subtrees. This is done in order to enable more simple server-side maintenance tasks (e.g., delete a dataset is just deleting the directory, no need to sift through a joint key store to find all the pieces).

Conceptually, there is no reason to have this file-system representation by the only way one can materialize a RIA store. Any object store could do a per-dataset-per-key type object addressing. It would be not as straightforward to put bare repositories in such an alternative "backend". But with the invention of datalad-annex::, this is no longer a fundamental issue. Bare repositories can be also "just" annex keys.

All we know about "ORA" is that this is the name of the special remote implementation that can talk to RIA stores.

The rest has been bolted on (storage of annex keys in archives to trim inode consumption; aliases).

Relationship of ora special remote to the uncurl special remote

The uncurl special remote could be considered a strict superset of ora. It also allows for an identifier based organization, and supports the identifiers used by ora. uncurl uses a different IO abstraction layer. That already comes with some implementations (ssh, http, file).

It would be useful to compare the implementations in detail. We may find advantages and disadvantages on either side and could use the outcome for improving uncurl.

What uncurl does not do "out of the box" is retrieval of archive members. ATM only the archivist special remote does that, but only for archives that are available locally. This missing functionality could be implemented via a dedicated UrlOperations implementation. We could implement RiaSshUrlOperations, which is just SshUrlOperations, except for a fallback implementation of the download operation, that would look for the archive on access failure and see whether the archive can provide the desired key.

If we go down this path, we could also support HTTP-based partial archive access with a dedicated RiaHttpUrloperations (using the implementation draft that @christian-monch started).

Taken together, the ora special remote could be reimplemented as a plain uncurl remote (derived class), that employs a dedicated URL handler configuration for its internal AnyUrlOperations.

christian-monch commented 9 months ago

I cannot remember what ORA stands for

I think it is defined in the datalad-context as "git-annex optional remote access". IIUC, it refers to git-annex special remotes.

christian-monch commented 9 months ago

[...] Taken together, the ora special remote could be reimplemented as a plain uncurl remote (derived class), that employs a dedicated URL handler configuration for its internal AnyUrlOperations.

With regard to using uncurl, there is a "simple" way to simulate the behavior of a ria remote store with uncurl. For example, the RIA store: ria+file:///data/ria-stores/store-1 could be accessed by uncurl via the following git annex remote initialization:

$ git annex initremote uncurl-ria type=external externaltype=uncurl encryption=none \
    match='ria\+(?P<scheme>[^:]+)://(?P<site>[^/]*)/(?P<path>.*)$' \
    'url=file:///data/ria-stores/store-1/{datalad_dsid[0]}{datalad_dsid[1]}{datalad_dsid[2]}/{datalad_dsid[3]}{datalad_dsid[4]}{datalad_dsid[5]}{datalad_dsid[6]}{datalad_dsid[7]}{datalad_dsid[8]}{datalad_dsid[9]}{datalad_dsid[10]}{datalad_dsid[11]}{datalad_dsid[12]}{datalad_dsid[13]}{datalad_dsid[14]}{datalad_dsid[15]}{datalad_dsid[16]}{datalad_dsid[17]}{datalad_dsid[18]}{datalad_dsid[19]}{datalad_dsid[20]}{datalad_dsid[21]}{datalad_dsid[22]}{datalad_dsid[23]}{datalad_dsid[24]}{datalad_dsid[25]}{datalad_dsid[26]}{datalad_dsid[27]}{datalad_dsid[28]}{datalad_dsid[29]}{datalad_dsid[30]}{datalad_dsid[31]}{datalad_dsid[32]}{datalad_dsid[33]}{datalad_dsid[34]}{datalad_dsid[35]}/annex/objects/{annex_dirhash}/{annex_key}/{annex_key}'

(That is a rather long command line. This is mainly due to the construction of the 2-level dataset-id-based directory structure. A RiaRemote that is derived from UncurlRemote might define a template-identifier named, e.g., dsid_dirs that contains the directories for the 2-level directory hierarchy derived from the dataset id, i.e. dataset_id[:3] + '/' + dataset_id[3:] + '/'.)

A datalad push --to uncurl-ria builds a ria store-compatible annex object-file structure in the location described by the destination URL. (It is not a complete ria store because it only contains the annexed content, it does not contain bare git-repos, nor a ria-layout-version-file. If one adds the missing pieces, datalad can clone a dataset with the original ria-implementation).

Similarly, a ria+ssh store could be accessed via uncurl.

Obviously, this does not include access to the archive files (which are stored in <store-root>/<ds-id[0:3]>/<ds_id[3:]>/archives/archive.7z). It should be possible to access them, if fsspec supports 7z.

While the above indicates that ria-remotes might be implemented by inheriting from UncurlRemote and just providing proper match- and url-configuration, that might only be true for reading from ria-stores. The process of storing data in git annex remotes has to ensure that elements are not identified as present, while they are copied to a remote. That usually involves some temporary storage and an atomic move- or rename-operation. As far as I can see, that requires more logic than single uncurl operations.