distribits / distribits-2024-hackathon

1 stars 1 forks source link

Explore how to make datalad and dtool datasets interoperable #10

Open jotelha opened 5 months ago

jotelha commented 5 months ago

We posed the idea as a file at https://github.com/distribits/distribits-2024-hackathon/tree/main/datalad-dtool-interoperability

In short, I initially saw two approaches:

Yesterday, Wolfgang Traylor for example pointed to idea of looking at git-annex special remotes for that purpose.

After discussion with @matrss today, we came up with the following idea:

We ca export a dtool dataset as a "snapshot" of a versioned datalad dataset, with the unique mapping

dtool dataset UUID <-> (datalad dataset UUID, commit)

The exported dtool dataset should be annotated with the UUID and commit of the source datalad dataset.

Both dtool datasets and datalad datasets have not much obligatory metadata beyond what's available via the file system (and the .git logs in the case of datalad).

The few obligatory dtool dataset fields

{
   "name": "2022-02-09-test-dataset-with-umlaut-items",  
   "created_at": 1644403564.250988, 
   "frozen_at": 1644403787.24659
}

can be filled with the (folder) name of the datalad repository, the date of the datalad's dataset initial commit, and the date of the datalad dataset's commit at the time of export.

Having figured this out, export to a dtool dataset is very simple and should work in analogy to the archive export, https://docs.datalad.org/en/stable/generated/man/datalad-export-archive.html

Therefore, as a first step we want to create a simple export-dtool extension derived from https://github.com/datalad/datalad-extension-template

jotelha commented 5 months ago

The other way around, turning a dtool dataset into a datalad dataset, might be as simple as dumping a dtool dataset into a datalad dataset with a command like

datalad run dtool cp protocol://uri/of/source dataset .

and start working from there.

In the case of exporting a dtool dataset again, the extension above could look for a hidden .dtool folder within the datalad dataset, since for a dtool dataset on the local file system, this folder contains the dataset's metadata (https://peerj.com/articles/6562/#fig-1), and annotate the new snapshot with information on the dtool dataset the datalad dataset had been derived from initially.

A next step to think about would be a simple git annex special remote (read-only) for using dtool datasets as a source. This is possible since the dtool API allows fetching of single files from a dataset.

We could use https://github.com/matrss/datalad-cds/blob/main/src/datalad_cds/cds_remote.py as a template.

yarikoptic commented 5 months ago

If only for consumption, and you could produce a list of urls (and metadata) per file in dtool dataset then you could use https://docs.datalad.org/en/stable/generated/man/datalad-addurls.html to quickly populate dataset or hierarchy of them while keeping connected to data on original dtool dataset

jotelha commented 5 months ago

The result of today is the basis for a simple "datalad-dtool" extension , https://github.com/livMatS/datalad-dtool. Its derived from datalad's export_archive functionality, https://github.com/datalad/datalad/blob/35c5492469f53123d937b3da60f079912f749545/datalad/local/export_archive.py, but uses dtool's Python API to generate a valid dtool dataset.

jotelha commented 5 months ago

If only for consumption, and you could produce a list of urls (and metadata) per file in dtool dataset then you could use https://docs.datalad.org/en/stable/generated/man/datalad-addurls.html to quickly populate dataset or hierarchy of them while keeping connected to data on original dtool dataset

Thanks, that's another good idea, it would only work though for dtool datasets actually published on the web and accessible via http (https://dtool.readthedocs.io/en/latest/publishing_a_dataset.html#publishing-a-dataset), and that's usually not the case.

matrss commented 5 months ago

Another thought: the export to dtool could include a git bundle so that it also archives the history of how the dataset came to be. An import into a DataLad dataset could use this to import the git history in addition to the data.

jotelha commented 5 months ago

When talking with @candleindark about their datalad-registry during Saturday lunch, I learnt about them using a DataLad extension MetaLad (http://docs.datalad.org/projects/metalad/en/stable/index.html) to extract metadata from a DataLad dataset. That could be useful for importing/exporting dtool datasets with respect to dtool's README.yml free form metadata as well.

yarikoptic commented 5 months ago

yes -- getting a metadata extractor for such datalad datasets (where they are just dumped to as is into datalad) would be useful I think. BTW -- how to tell which README.yml among all those: https://github.com/search?q=path%3A/README.yml&type=code is for dtool?

jotelha commented 5 months ago

There is no fixed schema, a user is basically free to put whatever they like into a dataset's README.yml, and I think that had a certain appeal to me when starting to use dtool datasets. There is, however, the possibility to distribute templates (like this one, https://github.com/livMatS/dtool-demo/blob/059539de27447ee8c892b280b1757ba1c2287e4e/010-dataset-creation/dtool_readme_template.yml) among the users in a group, and that template is then used to populate the README.yml for every new dataset. This terminal session demonstrates this briefly, https://asciinema.org/a/511462.