Feature: Create CMIP, ISIMIP, CORDEX, and other metadata scraping operations

Zeitsperre commented 4 years ago

Using either the NetCDF4 or Xarray libraries, create a means of scraping metadata from files based on the following approaches:

Examining the filename and Path parts.
Examining the NetCDF4 file Global Attributes.
Creating a facet dictionary for each file examined ({Group: {FilePath: Facets}, {FilePath: Facets}, ... })
Allow for identification and querying based on a given facet.
Possible approach: Creation and population of a SQLite3 DB file?

huard commented 4 years ago

This needs to be integrated with Mourad's plan to host everything on THREDDS. It also has to be compatible with future search interface and ESGF storage. There is also a CEDA project looking at this. This also needs to be discussed with CRIM. This is high priority, but that should not mean rushing it.

Zeitsperre commented 4 years ago

The PAVICS crawler essentially does exactly this. It reads all facets and throws them into a Apache Spark (NoSQL) database. I agree that there are a lot of approaches for doing this and I like CEDA's approach. In this instance, I'm thinking of something for small scales (e.g. a few hundred/thousand files).

As this is simply a generic project, having it rely on an existing THREDDS server or a MySQL/PostgreSQL infrastructure is a bit overkill. SQLite3 would be good given the limited scope.

huard commented 4 years ago

But then we'll have to maintain two crawlers, two databases, two clients, ... no ? Unless I misunderstand where this project falls in our operationalization plan?

huard commented 4 years ago

From a project management perspective, who's specifying the requirements for this ? What are the deliverables exactly? Where are the use cases ? Has anyone done a lit. review ? For example https://esgf.llnl.gov/esgf-media/2018-F2F/2018-12-06/ESGF-F2F-2018-PYESSV-Greenslade.pdf https://github.com/ES-DOC/pyessv

Zeitsperre commented 4 years ago

I had done some looking around, spent a lot of time going through PyOuranos, and figured a reimplementation was needed. The requirements are written down in my notes so nothing online yet. This project (from my perspective) loosely has the following goals (based on some of the design goals of PyOuranos):

Simplify the workflow of common data handling, subsetting, and data transfers.
Automate the creation of work logs of actions performed and movements/modifications of files.
Leverage OOP structures to create workable small-scale DB-like objects that do not depend on hard-coded server paths or a specific oganisational scheme for data (ie: It doesn't matter how we structure our data on our servers, but we should be able to cross-examine it based on filenames/metadata).
Share these tools with others to reduce the easy-to-automate work of subsetting data (either via PyPI or by porting some of them to bird processes).

This project is something I could see members of SSC using and possibly others contributing to. I had done a bit of looking around but I never came across PyESSV. This might remove a lot of the code needed for handling ESGF metadata.

huard commented 4 years ago

We share the same goals, but this is meant to be a production service. It needs to be engineered, and this implies consultations with experts, a review of existing software, an understanding of the roadmap of other interconnected projects, and I don't think we have that.

Also, from my perspective, whatever we come up with has to be compatible with what the ESGF is cooking up at the moment since we want a unified catalog of local and remote files.

Zeitsperre commented 2 years ago

Following up on our conversation from a few weeks ago, there appear to be a few approaches to handling this issue in play right now:

pavics-vdb (THREDDS-hosted NetCDF facet scraping with validation using ESGF controlled vocabularies)
xscen (control and selection tool of catalogue databases and arborescenses of ESGF-ish facets for scenario development)

Where I see Miranda fitting in is in performing the following:

Converting non-compliant datasets (e.g. reanalyses) into NetCDFs with ESGF-ish and CF-compliant datasets.
Construction of the catalogues and folder-structures used by xscen internally (i.e.: not hosted on THREDDS)
Validation of new datasets (our internal facet schema compliance) for integration into our internal (or external/THREDDS-hosted) database.

The local needs (i.e.: Adding new data to our internal servers and providing a common, controlled-ish vocabulary for data outside the ESGF and ECMWF ecosystem) are what this library is focusing on for the moment. There's a very real scenario that these validation tools will be refactored elsewhere, possibly xscen. To be determined.

For now, the work towards this goal is present in #24, in the validators, decoders and cv modules. Once I can wrap my head around https://github.com/Ouranosinc/pavics-vdb/pull/46, I'll have a better idea of what to focus on moving forward.

Thanks again for your analysis!

Zeitsperre commented 2 years ago

The machinery needed for performing this in Miranda in now in the main branch. Closed with #24

Ouranosinc / miranda

Feature: Create CMIP, ISIMIP, CORDEX, and other metadata scraping operations #5