Tracking datasets versions

lucasgautheron commented 3 years ago

Is your feature request related to a problem? Please describe.

Standards may change, and it is important to keep track of the versions of the package a dataset is compatible with. For instance, this may allow upgrade scripts (e.g.. a script that would upgrade a dataset from version A to version B > A).

Describe the solution you'd like

Have some metadata file where the version would be registered
In case of success, child-project validate should upgrade this file with one entry indicating that the validation of the dataset at the current point in the tree, passed with the current version of the package.

alecristia commented 3 years ago

this sounds great.

are you thinking of providing instructions in the doc, so that we all data generators remember to do this every time we do a major change as well? For instance, imagine a living data set like Marvin's, where additional data gets added every couple of weeks. Imagine next that the user makes a mistake (eg forgets to fill in a key field in the metadata) -- does data get validated during pushes? could the user be reminded to name data pushes with major or minor names, so that they can revert to a version that eg didn't have an error introduced by a push? It occurs to me that this may be a topic for a vignette/instructional video, so perhaps we don't need to explicitly discuss this in the docs.

as for the datasets in our team, are we already considering them when we do updates to packages, ie checking which used to validate but now break?

lucasgautheron commented 3 years ago

imagine a living data set like Marvin's, where additional data gets added every couple of weeks. Imagine next that the user makes a mistake (eg forgets to fill in a key field in the metadata) -- does data get validated during pushes?

This is not something that we enforce in general, but this is what we do for our own datasets using continuous integration with Travis CI, as you can see here for instance : https://travis-ci.com/github/LAAC-LSCP/namibia-data/builds

Travis automatically runs tests on each commit, so if a change breaks the dataset, we get notified. But if the package is upgraded in a way that breaks compatibility with older datasets, it might break the tests until the dataset is upgraded. So I set up automatic weekly tests on our datasets, to make sure that they are still passing the tests, even while using the latest version of the package.

However, currently, Travis tests only check the metadata, and not the annotations or the recordings, because it would be to slow to download them. There is a direction for improvement in this regard.

could the user be reminded to name data pushes with major or minor names, so that they can revert to a version that eg didn't have an error introduced by a push? It occurs to me that this may be a topic for a vignette/instructional video, so perhaps we don't need to explicitly discuss this in the docs.

Correct me if I'm wrong, but since datalad uses git, this is something that can already be achieved using tagging ? E.g., you can run the tests, and if they pass, create a tag (such as v0.0.1) for the current tree.

But maybe we could provide tools to perform all these steps at once (run the tests, create a tag if they pass).

(Note that since we are using Travis CI, we can see directly from GitHub which commits passed the tests and which didn't in our datasets)

as for the datasets in our team, are we already considering them when we do updates to packages, ie checking which used to validate but now break?

As I said, yes, using weekly, automatic Travis tests.

alecristia commented 3 years ago

this sounds great. What do we need to implement it in our team? Is there anything you'd like me to do to stay compatible with this proposal?

LAAC-LSCP / ChildProject

Tracking datasets versions #167