distribits / distribits-2024-hackathon

1 stars 1 forks source link

Explore how to make datalad and dtool datasets interoperable #10

Open jotelha opened 7 months ago

jotelha commented 7 months ago

We posed the idea as a file at https://github.com/distribits/distribits-2024-hackathon/tree/main/datalad-dtool-interoperability

In short, I initially saw two approaches:

Yesterday, Wolfgang Traylor for example pointed to idea of looking at git-annex special remotes for that purpose.

After discussion with @matrss today, we came up with the following idea:

We ca export a dtool dataset as a "snapshot" of a versioned datalad dataset, with the unique mapping

dtool dataset UUID <-> (datalad dataset UUID, commit)

The exported dtool dataset should be annotated with the UUID and commit of the source datalad dataset.

Both dtool datasets and datalad datasets have not much obligatory metadata beyond what's available via the file system (and the .git logs in the case of datalad).

The few obligatory dtool dataset fields

{
   "name": "2022-02-09-test-dataset-with-umlaut-items",  
   "created_at": 1644403564.250988, 
   "frozen_at": 1644403787.24659
}

can be filled with the (folder) name of the datalad repository, the date of the datalad's dataset initial commit, and the date of the datalad dataset's commit at the time of export.

Having figured this out, export to a dtool dataset is very simple and should work in analogy to the archive export, https://docs.datalad.org/en/stable/generated/man/datalad-export-archive.html

Therefore, as a first step we want to create a simple export-dtool extension derived from https://github.com/datalad/datalad-extension-template

jotelha commented 7 months ago

The other way around, turning a dtool dataset into a datalad dataset, might be as simple as dumping a dtool dataset into a datalad dataset with a command like

datalad run dtool cp protocol://uri/of/source dataset .

and start working from there.

In the case of exporting a dtool dataset again, the extension above could look for a hidden .dtool folder within the datalad dataset, since for a dtool dataset on the local file system, this folder contains the dataset's metadata (https://peerj.com/articles/6562/#fig-1), and annotate the new snapshot with information on the dtool dataset the datalad dataset had been derived from initially.

A next step to think about would be a simple git annex special remote (read-only) for using dtool datasets as a source. This is possible since the dtool API allows fetching of single files from a dataset.

We could use https://github.com/matrss/datalad-cds/blob/main/src/datalad_cds/cds_remote.py as a template.

yarikoptic commented 7 months ago

If only for consumption, and you could produce a list of urls (and metadata) per file in dtool dataset then you could use https://docs.datalad.org/en/stable/generated/man/datalad-addurls.html to quickly populate dataset or hierarchy of them while keeping connected to data on original dtool dataset

jotelha commented 7 months ago

The result of today is the basis for a simple "datalad-dtool" extension , https://github.com/livMatS/datalad-dtool. Its derived from datalad's export_archive functionality, https://github.com/datalad/datalad/blob/35c5492469f53123d937b3da60f079912f749545/datalad/local/export_archive.py, but uses dtool's Python API to generate a valid dtool dataset.

jotelha commented 7 months ago

If only for consumption, and you could produce a list of urls (and metadata) per file in dtool dataset then you could use https://docs.datalad.org/en/stable/generated/man/datalad-addurls.html to quickly populate dataset or hierarchy of them while keeping connected to data on original dtool dataset

Thanks, that's another good idea, it would only work though for dtool datasets actually published on the web and accessible via http (https://dtool.readthedocs.io/en/latest/publishing_a_dataset.html#publishing-a-dataset), and that's usually not the case.

matrss commented 7 months ago

Another thought: the export to dtool could include a git bundle so that it also archives the history of how the dataset came to be. An import into a DataLad dataset could use this to import the git history in addition to the data.

jotelha commented 7 months ago

When talking with @candleindark about their datalad-registry during Saturday lunch, I learnt about them using a DataLad extension MetaLad (http://docs.datalad.org/projects/metalad/en/stable/index.html) to extract metadata from a DataLad dataset. That could be useful for importing/exporting dtool datasets with respect to dtool's README.yml free form metadata as well.

yarikoptic commented 7 months ago

yes -- getting a metadata extractor for such datalad datasets (where they are just dumped to as is into datalad) would be useful I think. BTW -- how to tell which README.yml among all those: https://github.com/search?q=path%3A/README.yml&type=code is for dtool?

jotelha commented 7 months ago

There is no fixed schema, a user is basically free to put whatever they like into a dataset's README.yml, and I think that had a certain appeal to me when starting to use dtool datasets. There is, however, the possibility to distribute templates (like this one, https://github.com/livMatS/dtool-demo/blob/059539de27447ee8c892b280b1757ba1c2287e4e/010-dataset-creation/dtool_readme_template.yml) among the users in a group, and that template is then used to populate the README.yml for every new dataset. This terminal session demonstrates this briefly, https://asciinema.org/a/511462.

jotelha commented 1 month ago

At the Hackathon, we drafted a simple datalad extension for exporting datalad datasets as dtool datasets (https://github.com/livMatS/datalad-dtool/blob/2024-08-29-special-remote/datalad_dtool/export.py)

With that, we can dump a dtool dataset as follows:

TMP="$(mktemp -d "${TMPDIR:-/tmp}/gar-XXXXXXX")"
BASE_URI="file://$TMP"

cd "$TMP"

datalad create my-datalad-dataset

cd my-datalad-dataset
echo "This is a test file" > testfile.txt
datalad save -m "Added a test file"
cd ..

datalad export-dtool --dataset my-datalad-dataset --name my-dtool-dataset "${BASE_URI}"

and inspect with dtool,

$ dtool ls my-dtool-dataset
393110d4a48dd1d4f5b75558ca1d4c985b3ae836    .datalad/.gitattributes
899067649b874df60c050d1a1d6b7312397acbdf    .datalad/config
24139dae656713ba861751fb2c2ac38839349a7a    .gitattributes
147a61012231fd1a7bfe0c57c88a972e93817ace    testfile.txt

$ dtool summary my-dtool-dataset
name: my-dtool-dataset
uuid: f477497b-db95-4b86-a85d-779ce43af22a
creator_username: jotelha
number_of_items: 4
size: 170.0B
frozen_at: 2024-10-24

$ dtool annotation ls my-dtool-dataset
datalad-commit  e460e8e70e6fb1ce24e4821f83a6d8a692bc6e7f
datalad-uuid    6c1aa86a-0e53-41c0-988b-4080ad96cd49

The annotations above track rudimentary provenance information about the source datalad dataset UUID and git commit:

$ cd my-datalad-dataset/

$ cat .datalad/config 
[datalad "dataset"]
    id = 6c1aa86a-0e53-41c0-988b-4080ad96cd49

$ git log
commit e460e8e70e6fb1ce24e4821f83a6d8a692bc6e7f (HEAD -> master)
Author: Johannes Hörmann <jotelha@jotelha-fujitsu-ubuntu-20.04>
Date:   Thu Oct 24 14:26:18 2024 +0200

    Added a test file

For the other way around - turning a dtool dataset into a datalad dataset - we envision a similar extension and workflow, e.g.

echo "This is a test file." > testfile.txt

dtool create my-dtool-dataset

# put item into dataset at git annex-expected key
dtool add item testfile.txt my-dtool-dataset
dtool freeze my-dtool-dataset

datalad create my-datalad-dataset
datalad import-dtool --dataset my-datalad-dataset my-dtool-dataset

or similar to populate the datalad dataset with the content from the dtool dataset.

At the Hackathon, the idea arose to create a dtool special remote (see post above, https://github.com/distribits/distribits-2024-hackathon/issues/10#issuecomment-2041022961) to consume dtool datasets as a source for files in git annex'ed repositories.

This would track the provenance properly as the dtool dataset and item in the dtool dataset would be recorded per imported file by means of the dtool special remote.

@SickSmile1 and me have now started to look into the implementation of git annex special remote based on https://github.com/Lykos153/AnnexRemote.

We started looking at the DirectoryRemote example (https://github.com/Lykos153/AnnexRemote/blob/7120e593c4f15d377481935dbef4f054535ee645/examples/git-annex-remote-directory) and the CDSRemote (https://github.com/matrss/datalad-cds/blob/f05de48482fdd06bea04da38d2b4810fe81909d8/src/datalad_cds/cds_remote.py) as a reference.

Later, we also found the RIA remote (https://github.com/datalad/git-annex-ria-remote/blob/master/ria_remote/remote.py) as another example of a complete implementation of a special remote based on AnnexRemote. This remote apparently also made its way into the core datalad code at https://github.com/datalad/datalad/tree/maint/datalad/customremotes.

Based on these examples, we managed to create a simple draft for a DtoolRemote (https://github.com/livMatS/datalad-dtool/blob/2024-08-29-special-remote/datalad_dtool/dtool_remote.py) that allows to run the read-only portion of the git annex testremote routine (see https://github.com/livMatS/datalad-dtool/blob/dd8813c58576bdc47a7f20ac95a648746fdf4a71/examples/test_readonly_git-annex-remote-dtool).

Several questions arose when trying to implement the desired behavior:

  1. If we have that read-only dtool special remote, would an import extension implemented in the same manner as the CDS downloader at https://github.com/matrss/datalad-cds/blob/f05de48482fdd06bea04da38d2b4810fe81909d8/src/datalad_cds/download_cds.py be in the spirit of the datalad ecosystem?

  2. In the comments above, we discussed a read-only remote for using dtool datasets and their contents as a source for files in a datalad dataset (or, on e level lower, git annex repo). Git annex itself, however, provides import and export interfaces. There are, for example, the importtree=yes and exporttree=yes properties when initializing a special remote. Would there be any benefit on implementing the export functionality tackled above on the datalad level already one level lower on the git-annex level, by writing an "export remote" instead of a read-only remote? Would there be any benefit from the datalad point of view?

The next question follows directly from the first:

  1. The difference between an ExportRemote and a standard (bi-directional, read-write) SpecialRemote is not entirely clear. With the aim of eventually providing the envisioned datalad import-dtool interface, is it necessary to implement both the standard remote methods listed at https://github.com/Lykos153/AnnexRemote/tree/master?tab=readme-ov-file#usage and the export remote methods listed at https://github.com/Lykos153/AnnexRemote/tree/master?tab=readme-ov-file#export-remotes ?

  2. It's not entirely clear to us how the key handling in the core methods transfer_store and transfer_retrieve of special remote is supposed to work. We understand the git annex documentation at https://git-annex.branchable.com/backends/ in the way that the specific method for generating keys is configurable by the user and not up to the implementation of the special remote. With that, we understand that the key provided to these methods can be anything according to the format

    BACKEND[-sNNNN][-mNNNN][-SNNNN-CNNNN]--NAME

mentioned on https://git-annex.branchable.com/internals/key_format . These keys can be generated from the file content (cryptographically secure) or from other information, e.g. URL only if using the URL backend by adding a file with

    git annex addurl --file testfile.txt dtool:file:///tmp/gar-59CvaJA/test-dataset/147a61012231fd1a7bfe0c57c88a972e93817ace

On the other hand, we have seen remotes that only handle very specific mappings, e.g. only key to URL in the case of the CDSRemote (https://github.com/matrss/datalad-cds/blob/f05de48482fdd06bea04da38d2b4810fe81909d8/src/datalad_cds/cds_remote.py#L62-L72) or only key to path in the case of the RIA remote (https://github.com/datalad/git-annex-ria-remote/blob/e8e05edad03bb6a0f85314a353d1f5ef1d0b75f4/ria_remote/remote.py#L899).

It would, for example, be possible to create a unique mapping between items in a dtool dataset, as the md5 check sums are stored in the dataset's manifest when freezing a dataset. Compare the git annex symbolic link and the content of the dtool manifest below for testfile.txt.

$ ls -lht my-datalad-dataset/
total 4,0K
lrwxrwxrwx 1 jotelha jotelha 118 Okt 24 14:26 testfile.txt -> .git/annex/objects/gv/PZ/MD5E-s20--5dd39cab1c53c2c77cd352983f9641e1.txt/MD5E-s20--5dd39cab1c53c2c77cd352983f9641e1.txt

$ cat my-dtool-dataset/.dtool/manifest.json 
{
  "dtoolcore_version": "3.18.2",
  "hash_function": "md5sum_hexdigest",
  "items": {
    "147a61012231fd1a7bfe0c57c88a972e93817ace": {
      "hash": "5dd39cab1c53c2c77cd352983f9641e1",
      "relpath": "testfile.txt",
      "size_in_bytes": 20,
      "utc_timestamp": 1729772779.628223
    },
    "24139dae656713ba861751fb2c2ac38839349a7a": {
      "hash": "5e802cfcfc878bb1dd91ddf4a10d9538",
      "relpath": ".gitattributes",
      "size_in_bytes": 55,
      "utc_timestamp": 1729772779.628223
    },
    "393110d4a48dd1d4f5b75558ca1d4c985b3ae836": {
      "hash": "82a20711e7f45f4d6a0fa53c9fb3f811",
      "relpath": ".datalad/.gitattributes",
      "size_in_bytes": 32,
      "utc_timestamp": 1729772779.628223
    },
    "899067649b874df60c050d1a1d6b7312397acbdf": {
      "hash": "d0086fbad1a559aea9029a0ba99d3fca",
      "relpath": ".datalad/config",
      "size_in_bytes": 63,
      "utc_timestamp": 1729772779.628223
    }
  }

This fact could be used in the special remote to give MD5-backend generated keys a special treatment and robust mapping, as attempted in the lines https://github.com/livMatS/datalad-dtool/blob/dd8813c58576bdc47a7f20ac95a648746fdf4a71/datalad_dtool/dtool_remote.py#L81-L96, but would that actually make sense?

Thanks a lot for reading and helping!

matrss commented 1 month ago

Re: importree and exporttree: my understanding is that those are most useful when the remote does not store key-value pairs (i.e. annexed files with a key and content), but are still mutable. A directory special remote with importtree=yes for example can be used to periodically import the content of that directory, and if some other process outside of git-annex makes changes to the directory those will be reflected in the repository after the import. Likewise, a directory special remote with exporttree=yes can be used to export the contents of a git tree to it, e.g. everything on the main branch. Then, when the main branch gets a few new commits, you can export to the same remote again and the directory you exported to will have those new changes reflected as well.

If this maps to some workflow you would want to support with dtool then it would make sense to implement them, but my understanding is that dtool datasets aren't really supposed to be mutable in this way (at least once frozen).

The fundamental difference between standard special remotes and export/import remotes is that the former have an understanding of git annex keys and how to store/retrieve their content, detached from any kind of directory structure or the filenames in the repository, while export remotes have an associated git tree hash that was exported to them and import remotes are essentially an "inbox" of new changes that can be periodically fetched. This is probably an oversimplification (e.g. export remotes can also contain annex objects by using the relatively new annexobjects=yes option), but this should be the general idea.


It's not entirely clear to us how the key handling in the core methods transfer_store and transfer_retrieve of special remote is supposed to work.

The core idea is that it is just a key-value store: transfer_store gets a key (this can be one from any of the supported key backends, e.g. SHA256(E), MD5(E), URL, WORM, etc., the special remote has no influence on that) and a file whose content is to be stored, and is expected to put that key-content pair into the underlying storage it represents. Likewise, transfer_retrieve is expected to fetch the content of a key that was previously stored on the remote.

The CDSRemote for example treats a json object describing the request to be sent to the Climate Data Store (which is stored as a URL associated to a key in git-annex) as the key for this "key-value store" (not to be confused with the key of the git-annex object), and will return the CDS' response to the request as its content. It is expected that the same request will always get the same response, so it loosely fits as a "key-value store", albeit read-only.

On the other hand, we have seen remotes that only handle very specific mappings, e.g. only key to URL in the case of the CDSRemote (https://github.com/matrss/datalad-cds/blob/f05de48482fdd06bea04da38d2b4810fe81909d8/src/datalad_cds/cds_remote.py#L62-L72)

It is important to make a distinction between the key (backends) and URLs. A key can have any number of URLs associated with it, and it doesn't matter to what backend the key belongs. E.g. a SHA256E key can have multiple associated URLs, which are assumed to be usable to fetch its content.

In the case of the CDSRemote this means that it still is agnostic to the key backend, it can handle them all. It takes whatever key it gets, asks git-annex what URLs are associated with that key, and then retrieves one of those. It doesn't have to deal with different keys in transfer_store, since it is read-only.

If you use git annex addurl without the --fast option then git-annex will fetch the content from the URL and store it using SHA256E (unless there is a different backend configured, e.g. DataLad changes it to MD5E for all files), while also associating the key with the source URL. If you use git annex addurl --fast, then git-annex won't fetch the content and therefore can't generate a key based on the content. In this case it will use the URL backend to generate a unique key from the URL, and associate the source URL with that URL key. These are separate things though, and I think you could use unregisterurl and registerurl to remove the associated URL from the key and replace it with something entirely different, without affecting the key (since that is just the unique identifier for this piece of content, and it would break e.g. symlinks in the repository if it were changed after the fact).

If a key uses one of the checksum backends, then git-annex will check that the content it gets matches the checksum, and throw an error if not. In the case of non-checksum backends it obviously can't do that (VURL is a special case, but so far I had trouble using it with a special remote).


I think what to do depends on some architectural decisions in dtool that I don't know about, but here are my thoughts anyway:


This got a bit long, but I hope it is at least vaguely helpful to you.

yarikoptic commented 1 month ago

good stuff. Are there exceptionally large (in number of files) "dtool datasets" ? I wonder if there is a need to provision automated partitioning into "subdatasets" (git submodules)? E.g. that's what we made easier with datalad addurls where // in the path specifies dataset boundary. Also in addurls we provide convenience to "synthesize" annex keys based on metadata, e.g. MD5 checksums. To a degree addurls could be considered a "poor man one way import-tree" solely based on metadata, and potentially from different sources/directories (that actually was initial motivator/use-case).