eurec4a / eurec4a-intake

Intake catalogue for EUREC4A field campaign datasets
17 stars 19 forks source link

versioning catalog #30

Open leifdenby opened 3 years ago

leifdenby commented 3 years ago

to avoid breaking dependent code what do people think about versioning the catalog? It would be nice to be able to refer to a specific git-tagged version when fetching the catalog.

I think with what we've got now we could create a version v1.0.0. If we adopt semver (https://semver.org/) we could update the MAJOR version we if we remove/add an endpoint and update the MINOR version when we add new endpoints, and finally update the PATCH version when we make fixed to existing endpoints. Adding/removing endpoint arguments could also be considered a breaking change I think.

Thoughts?

d70-t commented 3 years ago

I'll try to write down some of my thoughts on that topic.

Requirements

Semver for the intake catalog

Current status

Currently there are some active efforts to fill the catalog with more and more data around HALO, so I'd expect quite a few more changes in the next days. I am also not yet sure if the catalog hierarchy is already settled firmly enough so that renaming isn't necessary anymore, but I hope that we can reach that state soon.

Thoughts

If we adopt semver, I think that we'll have to increase versions quite rapidly in order to be able to both comply with semver and provide new datasets quickly. At the current state, it is already possible to access the catalog via the commit hash (i.e. https://raw.githubusercontent.com/eurec4a/eurec4a-intake/b6efdf3c57df9cbea014989b51a1c956d27c136c/catalog.yml) so in principle there are persistent identifiers available already. However, directly referring to that url is cumbersome, probably we'd have to add support for this into get_intake_catalog. I agree that semver versions look prettier, but I am not yet convinced that they provide more benefit than additional maintainance cost.

observingClouds commented 3 years ago

You might want to read through the possibility of packaging as well

d70-t commented 1 year ago

Ok, I think I've to re-warm this thread. As @observingClouds and @fjansson mentionded, for some journals, it's now mandatory to provide DOIs to reference data. Although I still doubt the technical usefullness of having a DOI on specific versions of the intake catalog (e.g. because the data gets moved away and thus old versions of the catalog will become broken), we might need them because of those requirements. On the upside, we might also use just the collection DOI if we only need one DOI and this can then be updated to a newer version.

In order to get some DOIs (e.g. via zenodo) we need to make releases. And for making releases, it seems to be useful to have version numbers. These days, the rate of major changes (e.g. data version changed) seems to be much lower than when this issue was opened, so I'm now thinking that applying semver might become more useful. Would you agree?


I'd try to formalize some rules for changing version numbers, my try would be:

☝️ as an endpoint, I'd consider anything specifying some dataset in intake language, e.g.: cat["foo"]["bar"](arg="baz") would be an endpoint. A notable consequence would be, that adding arguments can be minor if the defaults would result in the same dataset being retrieved if the new arguments are not specified by the user.

❗ I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location. But that's maybe undecidable, because removal could be at a later point in time... So probably this is just another variant of the general problem, that (without CIDs) old versions of the catalog will become broken over time due to link rot.

❓ this probably depends on how we see the requirements file. If it's a mostly internal thing (to drive the CI) I'd definitely go for the same level as other changes to CI, if we see it as a user facing API, this might be minor or even major...


Another question would be, how to decide when to do a release? I'd probably go for on-demand first, because weekly / monthly is probably too often in most cases and monthly / yearly is likey to sparse if people want to get things out... We might want to think about requiring successfull checks? 👈 I kind of like that, but that may lead to long standing blocking situations if servers are offline or we can't find some data quickly but don't want to remove their endpoints.

fjansson commented 1 year ago

Sounds good to me, I like the thought of having the catalog citable with a collection DOI. The semver rules above sound like a reasonable way of achieving that. Link rot of old catalog versions seems somewhat unavoidable, unless the data itself is also properly, permanently archived and given a DOI.

d70-t commented 1 year ago

unless the data itself is also properly, permanently archived and given a DOI.

Having a DOI on the data itself unfortunately is not a solution here. As by the DataCite documentation DOIs should resolve to a landing page (and not the content) and furthermore, "Humans should be able to reach the item being described from the landing page"... So by design there's no way to reference the content through a DOI in a machine readable way. Thus, to make the eurec4a-intake catalog work, we've had to circumvent the DOI system and had to go back to plain old links even for data which actually has a DOI.

In our setup, DOIs really are pretty useless things 🤷‍♂️ ...

observingClouds commented 1 year ago

The rules that @d70-t is suggesting seem reasonable to me as well. To keep track of the changes between versions and ultimately decide on its increment, we should start using a CHANGELOG or whatsnew.md and make it mandatory for all PRs. Otherwise we easily loose track of the changes and have a hard time to figure out the version of the release candidate.

Comments on the nuances of the rules

if we see it as a user facing API, this might be minor or even major

I argue that changes to the requirements (or similar) are no minor/major increase, because this catalog is not installable and we should assume that users take care of the dependencies as well. Practically, we will likely not even release a new version just because of a change to the requirements.

I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location.

This depends if we trust the move. The past experience has shown that these moves are often not communicated and the maintainers of this project find out only afterwards due to failing tests. It might be the easiest/safest option to increase the major version in these cases.

how to decide when to do a release? I'd probably go for on-demand first

Me too and in case of major changes to prevent citations of an outdated catalog.

Usage of collection DOI for citation

This is twofold:

My suggestion would be to use the collection DOI whenever the DOI is given directly, e.g. in data availability statements: The data used in this paper and all its future versions can be accessed at doi.org/XX.XXX/XXXXXX. When the data is cited, e.g. in the body of the manuscript, I would tend to use the exact DOI: We use the XYZ-Dataset from the EUREC4A-Intake catalog (Kölling et al., 2022).

leifdenby commented 1 year ago

semver might become more useful. Would you agree?

Great! YES! I think we need to introduce a CHANGELOG though once we start versioning. Version numbers in them selves aren't very useful without a changelog. Although the commit history contains the same info it is much more convenient to have a text file. I tend do follow something like xarray (https://github.com/pydata/xarray/blob/main/doc/whats-new.rst), for example https://github.com/EUREC4A-UK/lagtraj/blob/master/CHANGELOG.md. This is also a useful reference: https://keepachangelog.com/en/1.0.0/

I'm wondering if moving the source location should be major update, if the dataset gets removed at the old location. But that's maybe undecidable, because removal could be at a later point in time... So probably this is just another variant of the general problem, that (without CIDs) old versions of the catalog will become broken over time due to link rot.

Another option would be add alias-links when moving things and then introduce a form of deprecation (removing aliases in the next major version). I would suggest if endpoints are moved we should view that in the same way as deletions. If something is no longer in the same place it will function equivalently to not being there.

  • changes to requirements.txt (and similar) -> patch update (question)

I don't see any harm in these being in patch updates.

Another question would be, how to decide when to do a release? I'd probably go for on-demand first, because weekly / monthly is probably too often in most cases and monthly / yearly is likey to sparse if people want to get things out... We might want to think about requiring successfull checks? point_left I kind of like that, but that may lead to long standing blocking situations if servers are offline or we can't find some data quickly but don't want to remove their endpoints.

I think based on demand for releases sounds like a good idea. But really it will come down to who has time to do this maintenance work. Similarly for requiring that tests pass. I would opt for tests always needing to pass, otherwise the maintenance burden can become very big.

observingClouds commented 1 year ago

Another option would be add alias-links when moving things and then introduce a form of deprecation (removing aliases in the next major version). (@leifdenby)

This is a great idea if a dataset is moved within the catalog e.g. from cat.X.Z to cat.X.Y. If the source location is moved though, I don't think it will be possible. Most of the time it seems that datasets just disappear and we just notice that afterwards when our tests fail. Because we do not have access to the hosts, we cannot influence the time of deletion or introduce some grace-period.

I would opt for tests always needing to pass (@leifdenby)

Ideally, yes, but what do we do with data sources that are unstable (e.g. #131 )? Shall we remove those datasets from the catalog after a grace-period? I think users might still benefit from those entries.

observingClouds commented 1 year ago

Regarding the removal of catalog entries, I thought we could also write an additional intake driver that allows us to add messages into the catalog. It is very much work in progress and I don't know if I have time to follow-up on this much further, but I'm curious what you guys think about the idea https://github.com/observingClouds/intake_warning

d70-t commented 1 year ago

I like the idea :+1:. We'd have to ensure though, that people will install the warning driver (but that should be possible and if they don't have it, the worst thing which could happen is, they's get a less useful warning).

observingClouds commented 12 months ago

Hi everyone (@leifdenby @d70-t),

What is hindering us to move this forward and publish our first version? I have a paper in the last stages before it gets published and I could start as an example. As long as we only have http-links in the catalog and don't have control over the linked datasets the version might be less meaningful, but it might be a step forward?! I don't see us to provide/link all datasets in an unmutable way in the near future. Those datasets in the catalog that do have a DOI I will also cite explicitly, something that we probably should encourage on e.g. the readme page and/or on howto.eurec4a.eu as well.

Any thoughts? Could we try to release a first version by the end of the week? This might help some other papers as well (e.g. publications of @fjansson,@PouriyaTUDelft)

Cheers, Hauke

fjansson commented 12 months ago

I'd like having a DOI for the intake catalog. The Cloud Botany paper is near the proof stage now, I'd happily cite the DOI there, if we can have it within a few days :)

d70-t commented 11 months ago

I'll try to give it a shot. I'm however not yet sure what to put especially in fields like license and authors.. (see #147)

observingClouds commented 11 months ago

@d70-t before doing the first release, we should probably clean-up the current CHANGELOG.md in some way. Not sure what the best solution is, but the easiest might be to empty it completely with only the section headers remaining.

d70-t commented 11 months ago

@d70-t before doing the first release, we should probably clean-up the current CHANGELOG.md in some way. Not sure what the best solution is, but the easiest might be to empty it completely with only the section headers remaining.

Probably that comment came in too late... I just did what's been written in RELEASING...

observingClouds commented 11 months ago

Yeah sorry, but I think it is fine the way you did it. Thanks @d70-t so much for this afternoon/morning hack-session. I think we got a lot of things done and moved this project forward by a good margin.

d70-t commented 11 months ago

Here's the collection DOI. I think if we reference any, we should use this one (as discussed above, this gives a change of keeping up with the movement of datasets).

https://doi.org/10.5281/zenodo.8422321

observingClouds commented 11 months ago

So next, I think we should

d70-t commented 11 months ago
  • convert our discussion here into a HOW_TO_RELEASE.md including both our version semantics and the actual steps on how to do a release

There's RELEASING.md which I guess covers most of the semantics and the actual steps (I followed the steps while doin the 1.0.0 release). Probably we'll have to do another pass over this thread and the RELEASING.md to check if it actually reflects the outcome of this thread.

d70-t commented 11 months ago

Probably we'll also want to have a more complete description text on the zenodo page for upcoming releases (I guess, at lest a mention of the howto.eurec4a.eu would be good).

Bildschirmfoto 2023-10-10 um 18 04 15
observingClouds commented 8 months ago

linking https://github.com/intake/intake/issues/775