datalad / datalad-remake

Other
0 stars 0 forks source link

Design `datalad remake-provision` #12

Open mih opened 2 months ago

mih commented 2 months ago

Presently blocked by:https://github.com/psychoinformatics-de/datalad-concepts/issues/174

This is about the first half of #10 -- a datalad-based data source or data provisioning helper. See https://github.com/datalad/datalad-remake/issues/13 for the other half.

Purpose

Materialize data in the form of files (in directories) on the filesystem. Data are obtained via some method involving datalad (datasets) (i.e. clone, get, download, clone-from-metadata, etc.). The creation of a full clone, or a checkout is no necessity, not is the result of a provisioning (necessarily) a Git repository.

Target use cases

(1) is serving the standard use case of datalad run/rerun. (2) can be useful for composing workflows that do not require a particular directory layout. The could be executed without having to fiddle with checkouts of nested datasets. Instead, any required content (tracked via Git in some way) can be produced under any (fixed) given name, and fed to a workflow (which can run locally or remotely).

API

We need to be able to specify

(2) would not be needed when relative paths always have a version prefix (ala a Git tree-ish). (4) could be merged with (3) via a configurable and optional delimiter (e.g. NULL byte by default) that turns the content identifier into a source/dest pair (only relevant for the CLI).

Related issues

mih commented 2 months ago

Quick Sunday downtime realization to be integrated above: