datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

Coordination: RIA annex remote requirements, Implementation alternatives, and status #107

Open christian-monch opened 2 months ago

christian-monch commented 2 months ago

This issue is intended to serve as a coordination hub for RIA annex remote requirements, a description of implementation alternatives, and the selection of implementation options. (I am using "RIA annex remote" instead of "ORA" here to reduce the name space a little).

Requirements for the annex remote

The following lists contain the identified functional and non-functional requirements. Check-marked requirements apply. Un-checked requirements are identified but do not need to be fulfilled. Add new requirements by editing this issue and leaving a notification about the changes in the changelog.

Functional requirements

Non-functional requirements

Implementation alternatives and status for the annex remote

IO abstraction vs multi-flavor RIA annex-remote implementation

In issue #99 we concluded that it is too restrictive to base the RIA annex-remote implemented on a file-system paradigm. It turned out that this abstraction layer is a logical bottleneck that works well for file-based access but does not translate easily to HTTP-based access. It is also unlikely to work for general object stores (it would require to extend the abstraction layer with object store-specific operations and switching between them in the higher-level implementation). See alse #30.

The chosen alternative is an implementation that uses object-store specific handler to implement the basic annex-remote operations, e.g. TRANSFER RETRIEVE, TRANSFER STORE, CHECKPRESENT, and DELETE.

This is currently done in PR #106. An abstract base class defines transfer_store, transfer_retrieve, checkpresent, and remove. ssh-, file-, and http-specific subclasses implement the abstract methods for the respective store.

Current choice: multi-flavor RIA annex-remote implementation

URL-operations vs. individual implementations

Generally, URL-operations map nicely onto annex remote-operations, e.g. TRANSFER RETRIEVE maps onto download. So it seems natural to completely rely on UrlOperations to implement the RIA annex remote (for supported URL-schemes). But issue #102 (atomicity) and issue #103 (ensure_writable) highlight that annex remotes might not be fully supported yet.

There might also be an efficiency issue, at least for SshUrlOperations. SshUrlOperations set up a new ssh-connection for each operation. Therefore PR #106 uses the new persistent shell from datalad_next.shell (which is not yet merged into the main branch of datalad-next). The persistent shell supports arbitrary shell commands, which allows for efficient implementations of atomicity and ensure_writable (it also allows the remote execution of scripts, which can improve the efficiency of complex operations like ensure writable).

Current choice: individual implementations, using UrlOperations and persistent shells

Requirements for datalad create-sibling-ria

The "datalad create-sibling-ria"-commands should move from datalad-core to datalad-ria. The commands use the io-abstraction. If we drop the io-abstraction (as argued above), the commands should probably be re-implemented to remove the io-abstraction layer.

Changelog

2024-04-12: @christian-monch: created