NASA-IMPACT / admg-backend

Apache License 2.0
2 stars 0 forks source link

Technical discovery about CMR to help with ESDS transition #539

Closed heidimok closed 1 year ago

heidimok commented 1 year ago

Context

ADMG transition/hand-off talks with ESDS have started. But there is still uncertainty around both the timeline and future vision for CASEI after handing off to ESDS, which then makes it hard for us to prepare for that future. In an ideal world it seems that ESDS would like CASEI's metadata to come from CMR instead of our backend so that they can maintain one system. However the shape of CASEI data may/may not be the same as that of CMR so there are details that may not be clarified until time is given to really understanding both systems.

Problem

We are familiar with our system, but not CMR, which makes it hard to have productive conversations.

Acceptance Criteria

Light activity: DevSeed to tool up on our CMR knowledge to help support discussions. Be aware that CASEI is about airborne vs. CMR is geared towards satellites.

These conclusions can be placed in a document for now, but later likely to be included in the shared wiki with ESDS. https://github.com/NASA-IMPACT/admg-backend/issues/538

naomatheus commented 1 year ago

The critical difference between CASEI and CMR data models is how they structure their metadata for search.

In CMR, the search is oriented around dataset characteristics, with the dataset's metadata serving as the primary entry point. Metadata includes spatial extent, temporal extents, science keywords, and DOIs. In contrast, CASEI's data model begins with higher-level features like focus area, geographical region, platform, instrument, etc., before drilling down into more specific dataset characteristics. In other words, CMR's search starts with the "what," while CASEI begins with the "how" or "where."

There is a sort of "inverse" relationship between the two data structures/systems. The challenge in integrating CASEI's data into CMR involves finding a way to flip this relationship while still maintaining the ability to search by the unique features of both databases.

Given that many of the same datasets can be found in both CMR and CASEI, we could adjust the earlier CMR Collection metadata object to include CASEI-specific features at the top level, similar to how CASEI structures its data:

{
    "CollectionMetadata": {
        "title": "titleDetails",
        "shortname": "shortNameDetails",
        "description": "descriptionDetails",
        "FocusAreas": "focusAreaDetails",
        "GeographicalRegions": "geographicalRegionDetails",
        "GeophysicalConcepts": "geophysicalConceptDetails",
        "MeasurementTypes": "measurementTypeDetails",
        "Platforms": "platformDetails",
        "Instruments": "instrumentDetails",
        "Campaigns": "campaignDetails"
    },
    "CollectionCitations": "citationDetails",
    "SpatialExtent": "spatialExtentDetails",
    // ... rest of the UMM metadata markers
}
naomatheus commented 1 year ago

Here's a more visual comparison between the two models also.

UMM Unique Metadata Markers (CMR) Equivalent CASEI GraphQL Query
CollectionCitations allCampaign (nodes -> long_name, short_name)
SpatialExtent allGeographicalRegion (nodes -> id, shortname: short_name, example)
CollectionProgress Not available
ScienceKeywords allMeasurementType (nodes -> id, shortname: short_name, longname: long_name)
TemporalExtents Not available
ProcessingLevel Not available
DOI Not available
ShortName site (siteMetadata -> shortname)
EntryTitle site (siteMetadata -> title)
DirectDistributionInformation Not available
RelatedUrls Not available
DataDates Not available
Abstract site (siteMetadata -> description)
LocationKeywords allGeophysicalConcept (nodes -> id, longname: long_name)
MetadataDates Not available
Version Not available
Projects Not available
UseConstraints Not available
DataCenters Not available
Platforms allPlatform (nodes -> long_name, short_name)
MetadataSpecification Not available
ArchiveAndDistributionInformation Not available

Note that this table simplifies the many features of each model to the ones directly relevant to your question. The "Required Changes" column suggests additions to the CMR data model to accommodate CASEI's unique structure and search functionality.

It is important to note that these changes would not remove or alter the existing structure of CMR data model but instead extend it to accommodate CASEI-specific characteristics. However, these changes would require careful implementation to fit into the existing system without causing disruption or performance issues.

naomatheus commented 1 year ago
Feature CMR CASEI Required Changes
Title Part of CollectionCitations Top-level siteMetadata.title Add top-level 'title' field in CMR collection metadata
Short Name Part of the dataset Top-level siteMetadata.shortname Add top-level 'shortname' field in CMR collection metadata
Description Part of Abstract Top-level siteMetadata.description Add top-level 'description' field in CMR collection metadata
Focus Areas Not included Top-level allFocusArea Add new top-level 'FocusAreas' field in CMR collection metadata
Geographical Regions Included in SpatialExtent Top-level allGeographicalRegion Add new top-level 'GeographicalRegions' field in CMR collection metadata
Geophysical Concepts Not included Top-level allGeophysicalConcept Add new top-level 'GeophysicalConcepts' field in CMR collection metadata
Measurement Types Not included Top-level allMeasurementType Add new top-level 'MeasurementTypes' field in CMR collection metadata
Platforms Included in Platforms Top-level allPlatform Maintain 'Platforms' field but accommodate for CASEI specific platform details
Instruments May be included at Platforms.Instruments.ShortName/LongName* Top-level allInstrument Add new top-level 'Instruments' field in CMR collection metadata
Campaigns Not included Top-level allCampaign Add new top-level 'Campaigns' field in CMR collection metadata
Spatial Extent Included in SpatialExtent Not Included* Calculated as Spatial Bounds Maintain 'SpatialExtent' field in CMR
Temporal Extents Included in TemporalExtents Not Included* Calculated as Spatial Bounds Maintain 'TemporalExtents' field in CMR
Science Keywords Included in ScienceKeywords Not included Maintain 'ScienceKeywords' field in CMR
DOI Included in DOI Not included Maintain 'DOI' field in CMR
naomatheus commented 1 year ago

CASEI CMR Exploration Summary:


edkeeble commented 1 year ago

@naomatheus this looks like an excellent initial investigation into the CMR and how its data models relate to the CASEI data models. The notebook is a great touch. I have minimal knowledge of the CMR and am still not clear on what a Collection actually represents. For example, you mentioned storing platforms, instruments and campaigns within the CollectionMetadata, but in that case would we be dealing with a single Collection representing all CASEI data or is a Collection more like a DOI and we would be querying multiple Collections in order to build the CASEI frontend?

I wouldn't worry too much about whether the data in CMR currently matches up with CASEI. If we did use CMR as a backend, finding a way to publish new CASEI data to CMR would be part of that effort. In terms of next steps, I would focus on what would be required to practically use CMR as a backend for building the CASEI frontend:

  1. Is there a practical way to query CMR for all collections, instruments and platforms? For example, searching on short name or any property unique to each collection won't work, because we will need a separate list of collections. We would need to be able to execute a search like {type: "collection"}.
  2. Is it possible to store campaign, instrument and platform data as first class objects within the CMR? It's fine to store them as metadata on other objects, but that would require significant data duplication and make the process of getting a list of unique campaigns expensive and maybe completely impractical.
naomatheus commented 1 year ago

Regarding Q1

Query CMR for all collections, instruments and platforms

@edkeeble Yea. Ideally there is, I haven't seen that yet in documentation, and am still looking into this. So far I've had to use existing data in CASEI to create search parameters/keyword terms. For which we need first a list of "items being searched for."

Regarding

Query CMR for all collections, instruments and platforms

@praveenphatate made this document going over some of the CMR direct curl/http queries.

naomatheus commented 1 year ago

Notebook listed above has been updated

naomatheus commented 1 year ago

Made some further updates to this exploration @edkeeble and logged them here Repository: NASA-IMPACT/casei-cmr-explorer

Please reach out to @praveenphatate if I am OOO. @praveenphatate is a great resource who knows where we went with this exploration of CMR for CASEI.

Regarding your other questions @edkeeble Question 1 seems not to be feasible in our opinion because it would require filtering all collections upon many different attributes which may or may not present in JSON. Answering Question 2 we weren’t sure how to get an answer to that.

Short summary

JSON version of the collection's location page does not contain the same parameters - in the example https://cmr.earthdata.nasa.gov:443/search/concepts/C1977826980-GHRC_DAAC.json, the Instruments tag is completely missing, as are other subtags.

That being said, XML formats returned from CMR are fully comprehensive.

XML format can be returned from collections Location XML tag. Each collection has N number of collection data Locations. These Locations then have the XML metadata that CASEI needs - including nearly all of information that can be found on CASEI's individual Campaign, Platform, and Instrument pages.

We believe that there is a requirement to maintain a "reference document" for CASEI in order for there to be an efficient means to query CMR. Filtering from all collections does not seem feasible. The "reference document" should essentially be a list of platforms, instruments, campaigns, and projects that will be viewable in CASEI. It can be maintained and updates in a very minimal form where it is simply used to Query CRM either at CASEI's build time in CI or periodically if there is a static database of CMR to CASEI interfaces.

The solutions described here do not require storing of JSON or XML objects to maintain CASEI. CASEI's data requirements can be satisfied "on the fly" within a CI/CD build script.

heidimok commented 1 year ago

I'm just adding a note to close out this technical exploration.

Can we actually build CASEI using CMR as a backend?

Follow ups I think this issue fits into a broader technical discovery that I'll create an EPIC for that would ideally help us answer some additional questions such as:

Generally, how can we document all these loose thoughts and recommendations from the current technical team in a helpful and simple way that makes it easy for both the ADMG leadership and ESDS developers make decisions about what the future of CASEI is going to look like?