leifdenby commented 3 years ago

to avoid breaking dependent code what do people think about versioning the catalog? It would be nice to be able to refer to a specific git-tagged version when fetching the catalog.

I think with what we've got now we could create a version v1.0.0. If we adopt semver (https://semver.org/) we could update the MAJOR version we if we remove/add an endpoint and update the MINOR version when we add new endpoints, and finally update the PATCH version when we make fixed to existing endpoints. Adding/removing endpoint arguments could also be considered a breaking change I think.

Thoughts?

d70-t commented 3 years ago

I'll try to write down some of my thoughts on that topic.

Requirements

We need a form of stable references to a catalog, as changes will happen, but users certainly want to re-run their old scripts exactly as they are with exactly the original data.
We need to be able to release updates to the catalog quickly, such that newly available datasets can be used properly without delay or workarounds.
We want people to use the newest available data by default.

Semver for the intake catalog

If a new dataset is added to the catalog, this should be a non-breaking change which adds functionality -> MINOR change
- except a user enumerates the catalog and operates on the result of that enumeration, then this could be a breaking change -> MAJOR change
If a dataset is updated, this is quite probably a breaking change for some users (even if it is only correcting a typo) -> MAJOR change
Datasets should not be removed. If they are -> MAJOR change
Datasets should ideally not be renamed, if they are -> MAJOR change
Currently there are substantial changes happening to files on the Aeris server (even if they already have a DOI allocated) these would be a MAJOR change for semver but can't be addressed as the catalog is not touched (to be fair, that's an issue which must be solved outside of the catalog)
I see very little opportunities for PATCH updates, but of course, that's not required 🤷

Current status

Currently there are some active efforts to fill the catalog with more and more data around HALO, so I'd expect quite a few more changes in the next days. I am also not yet sure if the catalog hierarchy is already settled firmly enough so that renaming isn't necessary anymore, but I hope that we can reach that state soon.

Thoughts

If we adopt semver, I think that we'll have to increase versions quite rapidly in order to be able to both comply with semver and provide new datasets quickly. At the current state, it is already possible to access the catalog via the commit hash (i.e. https://raw.githubusercontent.com/eurec4a/eurec4a-intake/b6efdf3c57df9cbea014989b51a1c956d27c136c/catalog.yml) so in principle there are persistent identifiers available already. However, directly referring to that url is cumbersome, probably we'd have to add support for this into get_intake_catalog. I agree that semver versions look prettier, but I am not yet convinced that they provide more benefit than additional maintainance cost.

observingClouds commented 3 years ago

You might want to read through the possibility of packaging as well

d70-t commented 1 year ago

Ok, I think I've to re-warm this thread. As @observingClouds and @fjansson mentionded, for some journals, it's now mandatory to provide DOIs to reference data. Although I still doubt the technical usefullness of having a DOI on specific versions of the intake catalog (e.g. because the data gets moved away and thus old versions of the catalog will become broken), we might need them because of those requirements. On the upside, we might also use just the collection DOI if we only need one DOI and this can then be updated to a newer version.

In order to get some DOIs (e.g. via zenodo) we need to make releases. And for making releases, it seems to be useful to have version numbers. These days, the rate of major changes (e.g. data version changed) seems to be much lower than when this issue was opened, so I'm now thinking that applying semver might become more useful. Would you agree?

I'd try to formalize some rules for changing version numbers, my try would be:

add a new endpoint -> minor update
delete a previously existing endpoint -> major update
change the content returned by an endpoint (e.g. data version updates) -> major update
change the source location of an endpoint (e.g. from one server to another) -> minor update (❗)
moving / renaming an endpoint should be considered as adding and deleting -> major update
changing endpoint metadata (e.g. description) -> patch update
changes to CI -> patch update
changes to requirements.txt (and similar) -> patch update (❓)

☝️ as an endpoint, I'd consider anything specifying some dataset in intake language, e.g.: cat["foo"]["bar"](arg="baz") would be an endpoint. A notable consequence would be, that adding arguments can be minor if the defaults would result in the same dataset being retrieved if the new arguments are not specified by the user.

❗ I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location. But that's maybe undecidable, because removal could be at a later point in time... So probably this is just another variant of the general problem, that (without CIDs) old versions of the catalog will become broken over time due to link rot.

❓ this probably depends on how we see the requirements file. If it's a mostly internal thing (to drive the CI) I'd definitely go for the same level as other changes to CI, if we see it as a user facing API, this might be minor or even major...

Another question would be, how to decide when to do a release? I'd probably go for on-demand first, because weekly / monthly is probably too often in most cases and monthly / yearly is likey to sparse if people want to get things out... We might want to think about requiring successfull checks? 👈 I kind of like that, but that may lead to long standing blocking situations if servers are offline or we can't find some data quickly but don't want to remove their endpoints.

fjansson commented 1 year ago

Sounds good to me, I like the thought of having the catalog citable with a collection DOI. The semver rules above sound like a reasonable way of achieving that. Link rot of old catalog versions seems somewhat unavoidable, unless the data itself is also properly, permanently archived and given a DOI.

d70-t commented 1 year ago

unless the data itself is also properly, permanently archived and given a DOI.

Having a DOI on the data itself unfortunately is not a solution here. As by the DataCite documentation DOIs should resolve to a landing page (and not the content) and furthermore, "Humans should be able to reach the item being described from the landing page"... So by design there's no way to reference the content through a DOI in a machine readable way. Thus, to make the eurec4a-intake catalog work, we've had to circumvent the DOI system and had to go back to plain old links even for data which actually has a DOI.

In our setup, DOIs really are pretty useless things 🤷‍♂️ ...

observingClouds commented 1 year ago

The rules that @d70-t is suggesting seem reasonable to me as well. To keep track of the changes between versions and ultimately decide on its increment, we should start using a CHANGELOG or whatsnew.md and make it mandatory for all PRs. Otherwise we easily loose track of the changes and have a hard time to figure out the version of the release candidate.

Comments on the nuances of the rules

if we see it as a user facing API, this might be minor or even major

I argue that changes to the requirements (or similar) are no minor/major increase, because this catalog is not installable and we should assume that users take care of the dependencies as well. Practically, we will likely not even release a new version just because of a change to the requirements.

I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location.

This depends if we trust the move. The past experience has shown that these moves are often not communicated and the maintainers of this project find out only afterwards due to failing tests. It might be the easiest/safest option to increase the major version in these cases.

how to decide when to do a release? I'd probably go for on-demand first

Me too and in case of major changes to prevent citations of an outdated catalog.

Usage of collection DOI for citation

This is twofold:

Ignoring the manifests here and encouraging the citation of the collection DOI might be a better practise in this particular (!) case from a user perspective. Users will be directed to the current and working catalog and will gain access to the data. If a dataset has been removed, both a version-DOI and the collection-DOI will return catalogs that fail.
For reproducibility, the user might want to access a particular version of the data. This information is only preserved in the version-DOI.

My suggestion would be to use the collection DOI whenever the DOI is given directly, e.g. in data availability statements: The data used in this paper and all its future versions can be accessed at doi.org/XX.XXX/XXXXXX. When the data is cited, e.g. in the body of the manuscript, I would tend to use the exact DOI: We use the XYZ-Dataset from the EUREC4A-Intake catalog (Kölling et al., 2022).

leifdenby commented 1 year ago

semver might become more useful. Would you agree?

Great! YES! I think we need to introduce a CHANGELOG though once we start versioning. Version numbers in them selves aren't very useful without a changelog. Although the commit history contains the same info it is much more convenient to have a text file. I tend do follow something like xarray (https://github.com/pydata/xarray/blob/main/doc/whats-new.rst), for example https://github.com/EUREC4A-UK/lagtraj/blob/master/CHANGELOG.md. This is also a useful reference: https://keepachangelog.com/en/1.0.0/

I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location. But that's maybe undecidable, because removal could be at a later point in time... So probably this is just another variant of the general problem, that (without CIDs) old versions of the catalog will become broken over time due to link rot.

Another option would be add alias-links when moving things and then introduce a form of deprecation (removing aliases in the next major version). I would suggest if endpoints are moved we should view that in the same way as deletions. If something is no longer in the same place it will function equivalently to not being there.

changes to requirements.txt (and similar) -> patch update (question)

I don't see any harm in these being in patch updates.

Another question would be, how to decide when to do a release? I'd probably go for on-demand first, because weekly / monthly is probably too often in most cases and monthly / yearly is likey to sparse if people want to get things out... We might want to think about requiring successfull checks? point_left I kind of like that, but that may lead to long standing blocking situations if servers are offline or we can't find some data quickly but don't want to remove their endpoints.

I think based on demand for releases sounds like a good idea. But really it will come down to who has time to do this maintenance work. Similarly for requiring that tests pass. I would opt for tests always needing to pass, otherwise the maintenance burden can become very big.

observingClouds commented 1 year ago

Another option would be add alias-links when moving things and then introduce a form of deprecation (removing aliases in the next major version). (@leifdenby)

This is a great idea if a dataset is moved within the catalog e.g. from cat.X.Z to cat.X.Y. If the source location is moved though, I don't think it will be possible. Most of the time it seems that datasets just disappear and we just notice that afterwards when our tests fail. Because we do not have access to the hosts, we cannot influence the time of deletion or introduce some grace-period.

I would opt for tests always needing to pass (@leifdenby)

Ideally, yes, but what do we do with data sources that are unstable (e.g. #131 )? Shall we remove those datasets from the catalog after a grace-period? I think users might still benefit from those entries.

observingClouds commented 1 year ago

Regarding the removal of catalog entries, I thought we could also write an additional intake driver that allows us to add messages into the catalog. It is very much work in progress and I don't know if I have time to follow-up on this much further, but I'm curious what you guys think about the idea https://github.com/observingClouds/intake_warning

d70-t commented 1 year ago

I like the idea :+1:. We'd have to ensure though, that people will install the warning driver (but that should be possible and if they don't have it, the worst thing which could happen is, they's get a less useful warning).

observingClouds commented 12 months ago

Hi everyone (@leifdenby @d70-t),

What is hindering us to move this forward and publish our first version? I have a paper in the last stages before it gets published and I could start as an example. As long as we only have http-links in the catalog and don't have control over the linked datasets the version might be less meaningful, but it might be a step forward?! I don't see us to provide/link all datasets in an unmutable way in the near future. Those datasets in the catalog that do have a DOI I will also cite explicitly, something that we probably should encourage on e.g. the readme page and/or on howto.eurec4a.eu as well.

Any thoughts? Could we try to release a first version by the end of the week? This might help some other papers as well (e.g. publications of @fjansson,@PouriyaTUDelft)

Cheers, Hauke

fjansson commented 12 months ago

I'd like having a DOI for the intake catalog. The Cloud Botany paper is near the proof stage now, I'd happily cite the DOI there, if we can have it within a few days :)