Technical discovery about CMR to help with ESDS transition

NASA-IMPACT / admg-backend

Apache License 2.0

2 stars 0 forks source link

Technical discovery about CMR to help with ESDS transition #539

Closed heidimok closed 1 year ago

heidimok commented 1 year ago

Context

ADMG transition/hand-off talks with ESDS have started. But there is still uncertainty around both the timeline and future vision for CASEI after handing off to ESDS, which then makes it hard for us to prepare for that future. In an ideal world it seems that ESDS would like CASEI's metadata to come from CMR instead of our backend so that they can maintain one system. However the shape of CASEI data may/may not be the same as that of CMR so there are details that may not be clarified until time is given to really understanding both systems.

Problem

We are familiar with our system, but not CMR, which makes it hard to have productive conversations.

Acceptance Criteria

Light activity: DevSeed to tool up on our CMR knowledge to help support discussions. Be aware that CASEI is about airborne vs. CMR is geared towards satellites.

[x] "light" initial exploration about CMR is conducted by an ADMG developer
[ ] Written down highlights and notes shared in a document focusing on: To what extent do we think CASEI and CMR can talk to each other?

These conclusions can be placed in a document for now, but later likely to be included in the shared wiki with ESDS. https://github.com/NASA-IMPACT/admg-backend/issues/538

naomatheus commented 1 year ago

The critical difference between CASEI and CMR data models is how they structure their metadata for search.

In CMR, the search is oriented around dataset characteristics, with the dataset's metadata serving as the primary entry point. Metadata includes spatial extent, temporal extents, science keywords, and DOIs. In contrast, CASEI's data model begins with higher-level features like focus area, geographical region, platform, instrument, etc., before drilling down into more specific dataset characteristics. In other words, CMR's search starts with the "what," while CASEI begins with the "how" or "where."

There is a sort of "inverse" relationship between the two data structures/systems. The challenge in integrating CASEI's data into CMR involves finding a way to flip this relationship while still maintaining the ability to search by the unique features of both databases.

Given that many of the same datasets can be found in both CMR and CASEI, we could adjust the earlier CMR Collection metadata object to include CASEI-specific features at the top level, similar to how CASEI structures its data:

{
    "CollectionMetadata": {
        "title": "titleDetails",
        "shortname": "shortNameDetails",
        "description": "descriptionDetails",
        "FocusAreas": "focusAreaDetails",
        "GeographicalRegions": "geographicalRegionDetails",
        "GeophysicalConcepts": "geophysicalConceptDetails",
        "MeasurementTypes": "measurementTypeDetails",
        "Platforms": "platformDetails",
        "Instruments": "instrumentDetails",
        "Campaigns": "campaignDetails"
    },
    "CollectionCitations": "citationDetails",
    "SpatialExtent": "spatialExtentDetails",
    // ... rest of the UMM metadata markers
}

naomatheus commented 1 year ago

Here's a more visual comparison between the two models also.

UMM Unique Metadata Markers (CMR)	Equivalent CASEI GraphQL Query
CollectionCitations	allCampaign (nodes -> long_name, short_name)
SpatialExtent	allGeographicalRegion (nodes -> id, shortname: short_name, example)
CollectionProgress	Not available
ScienceKeywords	allMeasurementType (nodes -> id, shortname: short_name, longname: long_name)
TemporalExtents	Not available
ProcessingLevel	Not available
DOI	Not available
ShortName	site (siteMetadata -> shortname)
EntryTitle	site (siteMetadata -> title)
DirectDistributionInformation	Not available
RelatedUrls	Not available
DataDates	Not available
Abstract	site (siteMetadata -> description)
LocationKeywords	allGeophysicalConcept (nodes -> id, longname: long_name)
MetadataDates	Not available
Version	Not available
Projects	Not available
UseConstraints	Not available
DataCenters	Not available
Platforms	allPlatform (nodes -> long_name, short_name)
MetadataSpecification	Not available
ArchiveAndDistributionInformation	Not available

Note that this table simplifies the many features of each model to the ones directly relevant to your question. The "Required Changes" column suggests additions to the CMR data model to accommodate CASEI's unique structure and search functionality.

It is important to note that these changes would not remove or alter the existing structure of CMR data model but instead extend it to accommodate CASEI-specific characteristics. However, these changes would require careful implementation to fit into the existing system without causing disruption or performance issues.

naomatheus commented 1 year ago

Feature	CMR	CASEI	Required Changes
Title	Part of CollectionCitations	Top-level siteMetadata.title	Add top-level 'title' field in CMR collection metadata
Short Name	Part of the dataset	Top-level siteMetadata.shortname	Add top-level 'shortname' field in CMR collection metadata
Description	Part of Abstract	Top-level siteMetadata.description	Add top-level 'description' field in CMR collection metadata
Focus Areas	Not included	Top-level allFocusArea	Add new top-level 'FocusAreas' field in CMR collection metadata
Geographical Regions	Included in SpatialExtent	Top-level allGeographicalRegion	Add new top-level 'GeographicalRegions' field in CMR collection metadata
Geophysical Concepts	Not included	Top-level allGeophysicalConcept	Add new top-level 'GeophysicalConcepts' field in CMR collection metadata
Measurement Types	Not included	Top-level allMeasurementType	Add new top-level 'MeasurementTypes' field in CMR collection metadata
Platforms	Included in Platforms	Top-level allPlatform	Maintain 'Platforms' field but accommodate for CASEI specific platform details
Instruments	May be included at Platforms.Instruments.ShortName/LongName*	Top-level allInstrument	Add new top-level 'Instruments' field in CMR collection metadata
Campaigns	Not included	Top-level allCampaign	Add new top-level 'Campaigns' field in CMR collection metadata
Spatial Extent	Included in SpatialExtent	Not Included* Calculated as Spatial Bounds	Maintain 'SpatialExtent' field in CMR
Temporal Extents	Included in TemporalExtents	Not Included* Calculated as Spatial Bounds	Maintain 'TemporalExtents' field in CMR
Science Keywords	Included in ScienceKeywords	Not included	Maintain 'ScienceKeywords' field in CMR
DOI	Included in DOI	Not included	Maintain 'DOI' field in CMR

naomatheus commented 1 year ago

CASEI CMR Exploration Summary:

Repository: NASA-IMPACT/casei-cmr-explorer
Objective: To investigate CMR's meta data for relevance to CASEI.
Tools Used: Primarily the earthaccess Python library as an interface to query CMR over cmr due to better usability.
Main Observations:
- CMR library wasn't very user-friendly.
- Discrepancies observed between the short_names of campaigns in CASEI and those in the CMR, indicating the need for a parity-establishing mechanism.
- Some of the CASEI short names did not correspond with attributes within the CMR.
- Keyword search mechanism in CMR isn't clear on what attributes it matches against.
Results:
- Successful matches for several campaign short names.
- Some did not return results, e.g., 'MIZOPEX', 'AMISA', and several others.
For a comprehensive look, the Jupyter notebook in the repository provides detailed code, results, and visualizations.

edkeeble commented 1 year ago

@naomatheus this looks like an excellent initial investigation into the CMR and how its data models relate to the CASEI data models. The notebook is a great touch. I have minimal knowledge of the CMR and am still not clear on what a Collection actually represents. For example, you mentioned storing platforms, instruments and campaigns within the CollectionMetadata, but in that case would we be dealing with a single Collection representing all CASEI data or is a Collection more like a DOI and we would be querying multiple Collections in order to build the CASEI frontend?

I wouldn't worry too much about whether the data in CMR currently matches up with CASEI. If we did use CMR as a backend, finding a way to publish new CASEI data to CMR would be part of that effort. In terms of next steps, I would focus on what would be required to practically use CMR as a backend for building the CASEI frontend:

Is there a practical way to query CMR for all collections, instruments and platforms? For example, searching on short name or any property unique to each collection won't work, because we will need a separate list of collections. We would need to be able to execute a search like {type: "collection"}.
Is it possible to store campaign, instrument and platform data as first class objects within the CMR? It's fine to store them as metadata on other objects, but that would require significant data duplication and make the process of getting a list of unique campaigns expensive and maybe completely impractical.

naomatheus commented 1 year ago

Regarding Q1

Query CMR for all collections, instruments and platforms

@edkeeble Yea. Ideally there is, I haven't seen that yet in documentation, and am still looking into this. So far I've had to use existing data in CASEI to create search parameters/keyword terms. For which we need first a list of "items being searched for."

Regarding

Query CMR for all collections, instruments and platforms

@praveenphatate made this document going over some of the CMR direct curl/http queries.

naomatheus commented 1 year ago

Notebook listed above has been updated

naomatheus commented 1 year ago

Made some further updates to this exploration @edkeeble and logged them here Repository: NASA-IMPACT/casei-cmr-explorer

Please reach out to @praveenphatate if I am OOO. @praveenphatate is a great resource who knows where we went with this exploration of CMR for CASEI.

Regarding your other questions @edkeeble Question 1 seems not to be feasible in our opinion because it would require filtering all collections upon many different attributes which may or may not present in JSON. Answering Question 2 we weren’t sure how to get an answer to that.

Short summary

JSON version of the collection's location page does not contain the same parameters - in the example https://cmr.earthdata.nasa.gov:443/search/concepts/C1977826980-GHRC_DAAC.json, the Instruments tag is completely missing, as are other subtags.

That being said, XML formats returned from CMR are fully comprehensive.

XML format can be returned from collections Location XML tag. Each collection has N number of collection data Locations. These Locations then have the XML metadata that CASEI needs - including nearly all of information that can be found on CASEI's individual Campaign, Platform, and Instrument pages.

We believe that there is a requirement to maintain a "reference document" for CASEI in order for there to be an efficient means to query CMR. Filtering from all collections does not seem feasible. The "reference document" should essentially be a list of platforms, instruments, campaigns, and projects that will be viewable in CASEI. It can be maintained and updates in a very minimal form where it is simply used to Query CRM either at CASEI's build time in CI or periodically if there is a static database of CMR to CASEI interfaces.

The solutions described here do not require storing of JSON or XML objects to maintain CASEI. CASEI's data requirements can be satisfied "on the fly" within a CI/CD build script.

heidimok commented 1 year ago

I'm just adding a note to close out this technical exploration.

Can we actually build CASEI using CMR as a backend?

Seems like the answer was yes, but in a roundabout way.
Outputs include this casei-cmr-explorer and document going over some of the CMR direct curl/http queries

Follow ups I think this issue fits into a broader technical discovery that I'll create an EPIC for that would ideally help us answer some additional questions such as:

Given what we've learned about what's possible/not today with using CMR as the backend...
- Would we need to build something new? Should we prototype anything? https://github.com/NASA-IMPACT/admg-backend/issues/543 on top of CMR to support unique metadata for CASEI?
- What might this mean for the curation experience - how could metadata be added to CMR - is it still through our MI or something new?
- What might future maintenance look like?
What does this mean for the existing system - what needs to change? Or does it become irrelevant?
- Should we invest development time into refactoring any parts of the existing system? https://github.com/NASA-IMPACT/admg-backend/issues/529

Generally, how can we document all these loose thoughts and recommendations from the current technical team in a helpful and simple way that makes it easy for both the ADMG leadership and ESDS developers make decisions about what the future of CASEI is going to look like?