Open jotelha opened 5 months ago
The other way around, turning a dtool dataset into a datalad dataset, might be as simple as dumping a dtool dataset into a datalad dataset with a command like
datalad run dtool cp protocol://uri/of/source dataset .
and start working from there.
In the case of exporting a dtool dataset again, the extension above could look for a hidden .dtool
folder within the datalad dataset, since for a dtool dataset on the local file system, this folder contains the dataset's metadata (https://peerj.com/articles/6562/#fig-1), and annotate the new snapshot with information on the dtool dataset the datalad dataset had been derived from initially.
A next step to think about would be a simple git annex special remote (read-only) for using dtool datasets as a source. This is possible since the dtool API allows fetching of single files from a dataset.
We could use https://github.com/matrss/datalad-cds/blob/main/src/datalad_cds/cds_remote.py as a template.
If only for consumption, and you could produce a list of urls (and metadata) per file in dtool dataset then you could use https://docs.datalad.org/en/stable/generated/man/datalad-addurls.html to quickly populate dataset or hierarchy of them while keeping connected to data on original dtool dataset
The result of today is the basis for a simple "datalad-dtool" extension , https://github.com/livMatS/datalad-dtool. Its derived from datalad's export_archive
functionality, https://github.com/datalad/datalad/blob/35c5492469f53123d937b3da60f079912f749545/datalad/local/export_archive.py, but uses dtool's Python API to generate a valid dtool dataset.
If only for consumption, and you could produce a list of urls (and metadata) per file in dtool dataset then you could use https://docs.datalad.org/en/stable/generated/man/datalad-addurls.html to quickly populate dataset or hierarchy of them while keeping connected to data on original dtool dataset
Thanks, that's another good idea, it would only work though for dtool datasets actually published on the web and accessible via http (https://dtool.readthedocs.io/en/latest/publishing_a_dataset.html#publishing-a-dataset), and that's usually not the case.
Another thought: the export to dtool could include a git bundle so that it also archives the history of how the dataset came to be. An import into a DataLad dataset could use this to import the git history in addition to the data.
When talking with @candleindark about their datalad-registry during Saturday lunch, I learnt about them using a DataLad extension MetaLad (http://docs.datalad.org/projects/metalad/en/stable/index.html) to extract metadata from a DataLad dataset. That could be useful for importing/exporting dtool datasets with respect to dtool's README.yml
free form metadata as well.
yes -- getting a metadata extractor for such datalad datasets (where they are just dumped to as is into datalad) would be useful I think. BTW -- how to tell which README.yml among all those: https://github.com/search?q=path%3A/README.yml&type=code is for dtool
?
There is no fixed schema, a user is basically free to put whatever they like into a dataset's README.yml
, and I think that had a certain appeal to me when starting to use dtool datasets. There is, however, the possibility to distribute templates (like this one, https://github.com/livMatS/dtool-demo/blob/059539de27447ee8c892b280b1757ba1c2287e4e/010-dataset-creation/dtool_readme_template.yml) among the users in a group, and that template is then used to populate the README.yml
for every new dataset. This terminal session demonstrates this briefly, https://asciinema.org/a/511462.
We posed the idea as a file at https://github.com/distribits/distribits-2024-hackathon/tree/main/datalad-dtool-interoperability
In short, I initially saw two approaches:
Yesterday, Wolfgang Traylor for example pointed to idea of looking at git-annex special remotes for that purpose.
After discussion with @matrss today, we came up with the following idea:
We ca export a dtool dataset as a "snapshot" of a versioned datalad dataset, with the unique mapping
The exported dtool dataset should be annotated with the UUID and commit of the source datalad dataset.
Both dtool datasets and datalad datasets have not much obligatory metadata beyond what's available via the file system (and the .git logs in the case of datalad).
The few obligatory dtool dataset fields
can be filled with the (folder) name of the datalad repository, the date of the datalad's dataset initial commit, and the date of the datalad dataset's commit at the time of export.
Having figured this out, export to a dtool dataset is very simple and should work in analogy to the archive export, https://docs.datalad.org/en/stable/generated/man/datalad-export-archive.html
Therefore, as a first step we want to create a simple export-dtool extension derived from https://github.com/datalad/datalad-extension-template