eurec4a / eurec4a-intake

Intake catalogue for EUREC4A field campaign datasets
17 stars 19 forks source link

Suggestion of CF compliance checker #53

Open observingClouds opened 3 years ago

observingClouds commented 3 years ago

Hi, I'm often using the IOOS compliance checker and just recently tried it in the command line. Wouldn't it be valuable if we would use it here ( and maybe add e.g. a EUREC4A metadata compliance test ) in the CI or maybe even as a GitHub bot, so that it just shows its findings like errors and warnings in the Pull request discussion but not necessarily stops a merge, but leaves the judgement to the reviewers? This would make it also easy to find issues with e.g. int64 or missing units etc.

The code for the checker is available on GitHub

d70-t commented 3 years ago

I think, this can be a very valuable addition. In particular if this also includes further specific tests as you suggest. I did noch check it thoroughly but there seems to be a Plugin system for the IOOS Checker on which we might want to jump on.

What we would need is to have a script which goes through all the items in the catalog and runs the check on it. This would probably be something similar to the check for the availability of datasets as it is currently in place. If the script would generate the output as a set of HTML pages, one could also show the current state of the datasets in a more readable way.

I am wondering if there's an easy option to run zarr based datasets through the compliance checker as well? --- Apart from converting them to local netCDF files and then running them through the checker.

Another question relevant for netCDF based sources might be if we want to run the checks through OPeNDAP only or if we also want to run them over the original netCDF files?

Another thing to keep in mind might be that the catalog is often not created by the people creating the datasets. But if the compliance checker or any additional tests fail, it would most likely be the original dataset which needs to be fixed. Thus maybe this repository is not exactly the right place for this tool.

Maybe we could also run a script which periodically checks for new datasets at the usual places (i.e. Aeris server, but maybe others as well) for new datasets and runs those checks on the datasets to create an overview of issues within the datasets even before the effort is made to include them into the catalog?

leifdenby commented 3 years ago

I am wondering if there's an easy option to run zarr based datasets through the compliance checker as well? --- Apart from converting them to local netCDF files and then running them through the checker.

It looks like we can simply construct a CheckSuite and pass the xr.DataSet directly to .run(...) (https://github.com/ioos/compliance-checker/blob/250f0b4a14ab8ffc034c342830b287758b660275/compliance_checker/runner.py#L76)

Maybe we could also run a script which periodically checks for new datasets at the usual places (i.e. Aeris server, but maybe others as well) for new datasets and runs those checks on the datasets to create an overview of issues within the datasets even before the effort is made to include them into the catalog?

Yes, it is possible to set up periodic CI jobs: https://docs.github.com/en/actions/reference/events-that-trigger-workflows#scheduled-events, the syntax is basically a crontab. I can create a pull-request for that. How often should we run? Once daily at 3am say?

d70-t commented 3 years ago

It looks like we can simply construct a CheckSuite and pass the xr.DataSet directly to .run(...)

That would be great! However, as far as I can tell, the ds in the referenced line is a netCDF4.Dataset (or derived type). Probably, we would have to do quite a bit of rewiring to convert all of it to xarray? -- Still, this would be awesome.

I can create a pull-request for that. How often should we run? Once daily at 3am say?

That would also be nice! But do you currently think about checking the datasets from the intake catalog once per night, or do you think about checking all datasets on Aeris and monitor for changes?

leifdenby commented 3 years ago

That would also be nice! But do you currently think about checking the datasets from the intake catalog once per night, or do you think about checking all datasets on Aeris and monitor for changes?

I was just thinking of checking the intake catalog :) We don't store anything relating to version/content of the data actually on AERIS in the catalog (I don't think?) so checking for a new version would mean implementing something for that. We could enforce that a version variable in the intake matches the version attribute on the referenced and loaded dataset. That might be a quite simple to check to add? Is this the kind of thing you were thinking?

Properly monitoring for changes on AERIS would require being able to "walk" the entire AERIS catalog to check for new/changed/deleted files, but I don't think we can do that?

d70-t commented 3 years ago

Hmm, I've lost quite a bit of trust in identifiers which are not cryptographic hashes of their contents. Things like

Oh, there has been an issue with the time variable, but don't worry, this has been fixed and the new dataset is now available at exactly the same DOI. The old dataset as been removed.

(paraphrased) do occur. Also I'd expect at least that if someone is kind enough to change the version inside of the dataset, then the reference (i.e. filename) to the dataset is changed as well (checking is of course better). And if that happens, then the filename which is inside the intake catalog will either still point to the old dataset or point to nothing anymore.

That said, Aeris is providing the Etag HTTP-Header which could assist in checking if a file has changed. I've also built an experimental code which walks all the files on Aeris (already about half a million if I remember correctly). But fetching all of them (even with proper Etag-handling) did not succeed in my previous tests as the server was not happy about it 🤷 . So in principle, walking across entries on a server (may it be Aeris or other ones) may work, but probably there should be less than 100 concurrent requests and probably there should be some form of additionnal rate limiting...

The nice thing about actively searching for datasets would be that one could inform authors earlier about potential issues regarding CF (or other) conventions and maybe increase the chance of getting them in a mood to still change some things :-) ... but probably that should also be part of ingress checking of the data archive?

d70-t commented 3 years ago

One more note: If we can verify that a dataset did not change and it has been CF compliant (or not), then this status will not change over time, so there is probably no need to check it over and over again.