Brain storming: datalad (run) metadata and description rendering

mih commented 5 years ago

Here is a reproducible analysis made by @adswa using open source tools and public data: https://github.com/adswa/multimatch_forrest

Pretty much all steps were captured via datalad, but the presentation is suboptimal, because it is solely based on a README that does not contain much information.

We should be able to compose suitable README content from the dataset (and its metadata) for such an analysis:

purpose
data/code dependencies
performed/captures data transformation (i.e. analysis steps)
summary of data component semantics (this is a YODA-style dataset)
pointer to datalad and information on how to work with such a dataset (could be a simple link to an info page on datalad.org (not something scattered around the docs, but a dedicated page ala "you came here, because you found a datalad dataset, but don't know what it is, and what you can do with it, yet).

Here are a few more datasets that "look" better, but the looks are also just based on a manually composed README:

Of course not everything can be inferred and automated, but being able to generate valid and informative description snippets would substantiallu lower the bar for having nice READMEs.

adswa commented 5 years ago

I would be more than happy to try to help where ever I can.

yarikoptic commented 5 years ago

Here is a dump of ideas for what I might have found useful if I get to some dataset I don't know.

From datalad run records it might be feasible and valuable to compile the list of commands used.
If we add some kind of tagging for executed commands (preprocessing, mvpa, localizer, etc), might be worth putting the graphs with stats on how many files were generated and on how many inputs (when those were provided).
List registered containers.
If containers-run wasn't just preparing the fill command to run, but left some metadata field with the name of the container, we could add references to corresponding commits.
Ultimate goal with all those would then be to be able to jump to corresponding commit for any action (eg pre processing) of interest.
Automatically adding to readme basic stats (# if subdatasets, sizes of files under annex, size of generated files with recorded commands to generate them, ie reproducible, etc) would also be useful IMHO.

yarikoptic commented 2 years ago

transfer alerted me to this nice old idea from @mih . I think the issue might move but not deprecate! eventually we should arrive at such metadata representation. I wonder though if we should may be take some specific domain/standard first to see what is missing. E.g. BIDS and neuroimaging data.

datalad / datalad-deprecated

Brain storming: datalad (run) metadata and description rendering #76