Closed heidimok closed 1 year ago
The critical difference between CASEI and CMR data models is how they structure their metadata for search.
In CMR, the search is oriented around dataset characteristics, with the dataset's metadata serving as the primary entry point. Metadata includes spatial extent, temporal extents, science keywords, and DOIs. In contrast, CASEI's data model begins with higher-level features like focus area, geographical region, platform, instrument, etc., before drilling down into more specific dataset characteristics. In other words, CMR's search starts with the "what," while CASEI begins with the "how" or "where."
There is a sort of "inverse" relationship between the two data structures/systems. The challenge in integrating CASEI's data into CMR involves finding a way to flip this relationship while still maintaining the ability to search by the unique features of both databases.
Given that many of the same datasets can be found in both CMR and CASEI, we could adjust the earlier CMR Collection metadata object to include CASEI-specific features at the top level, similar to how CASEI structures its data:
{
"CollectionMetadata": {
"title": "titleDetails",
"shortname": "shortNameDetails",
"description": "descriptionDetails",
"FocusAreas": "focusAreaDetails",
"GeographicalRegions": "geographicalRegionDetails",
"GeophysicalConcepts": "geophysicalConceptDetails",
"MeasurementTypes": "measurementTypeDetails",
"Platforms": "platformDetails",
"Instruments": "instrumentDetails",
"Campaigns": "campaignDetails"
},
"CollectionCitations": "citationDetails",
"SpatialExtent": "spatialExtentDetails",
// ... rest of the UMM metadata markers
}
Here's a more visual comparison between the two models also.
UMM Unique Metadata Markers (CMR) | Equivalent CASEI GraphQL Query |
---|---|
CollectionCitations | allCampaign (nodes -> long_name, short_name) |
SpatialExtent | allGeographicalRegion (nodes -> id, shortname: short_name, example) |
CollectionProgress | Not available |
ScienceKeywords | allMeasurementType (nodes -> id, shortname: short_name, longname: long_name) |
TemporalExtents | Not available |
ProcessingLevel | Not available |
DOI | Not available |
ShortName | site (siteMetadata -> shortname) |
EntryTitle | site (siteMetadata -> title) |
DirectDistributionInformation | Not available |
RelatedUrls | Not available |
DataDates | Not available |
Abstract | site (siteMetadata -> description) |
LocationKeywords | allGeophysicalConcept (nodes -> id, longname: long_name) |
MetadataDates | Not available |
Version | Not available |
Projects | Not available |
UseConstraints | Not available |
DataCenters | Not available |
Platforms | allPlatform (nodes -> long_name, short_name) |
MetadataSpecification | Not available |
ArchiveAndDistributionInformation | Not available |
Note that this table simplifies the many features of each model to the ones directly relevant to your question. The "Required Changes" column suggests additions to the CMR data model to accommodate CASEI's unique structure and search functionality.
It is important to note that these changes would not remove or alter the existing structure of CMR data model but instead extend it to accommodate CASEI-specific characteristics. However, these changes would require careful implementation to fit into the existing system without causing disruption or performance issues.
Feature | CMR | CASEI | Required Changes |
---|---|---|---|
Title | Part of CollectionCitations | Top-level siteMetadata.title | Add top-level 'title' field in CMR collection metadata |
Short Name | Part of the dataset | Top-level siteMetadata.shortname | Add top-level 'shortname' field in CMR collection metadata |
Description | Part of Abstract | Top-level siteMetadata.description | Add top-level 'description' field in CMR collection metadata |
Focus Areas | Not included | Top-level allFocusArea | Add new top-level 'FocusAreas' field in CMR collection metadata |
Geographical Regions | Included in SpatialExtent | Top-level allGeographicalRegion | Add new top-level 'GeographicalRegions' field in CMR collection metadata |
Geophysical Concepts | Not included | Top-level allGeophysicalConcept | Add new top-level 'GeophysicalConcepts' field in CMR collection metadata |
Measurement Types | Not included | Top-level allMeasurementType | Add new top-level 'MeasurementTypes' field in CMR collection metadata |
Platforms | Included in Platforms | Top-level allPlatform | Maintain 'Platforms' field but accommodate for CASEI specific platform details |
Instruments | May be included at Platforms.Instruments.ShortName/LongName* | Top-level allInstrument | Add new top-level 'Instruments' field in CMR collection metadata |
Campaigns | Not included | Top-level allCampaign | Add new top-level 'Campaigns' field in CMR collection metadata |
Spatial Extent | Included in SpatialExtent | Not Included* Calculated as Spatial Bounds | Maintain 'SpatialExtent' field in CMR |
Temporal Extents | Included in TemporalExtents | Not Included* Calculated as Spatial Bounds | Maintain 'TemporalExtents' field in CMR |
Science Keywords | Included in ScienceKeywords | Not included | Maintain 'ScienceKeywords' field in CMR |
DOI | Included in DOI | Not included | Maintain 'DOI' field in CMR |
CASEI CMR Exploration Summary:
earthaccess
Python library as an interface to query CMR over cmr
due to better usability.short_name
s of campaigns in CASEI and those in the CMR, indicating the need for a parity-establishing mechanism.@naomatheus this looks like an excellent initial investigation into the CMR and how its data models relate to the CASEI data models. The notebook is a great touch. I have minimal knowledge of the CMR and am still not clear on what a Collection actually represents. For example, you mentioned storing platforms, instruments and campaigns within the CollectionMetadata, but in that case would we be dealing with a single Collection representing all CASEI data or is a Collection more like a DOI and we would be querying multiple Collections in order to build the CASEI frontend?
I wouldn't worry too much about whether the data in CMR currently matches up with CASEI. If we did use CMR as a backend, finding a way to publish new CASEI data to CMR would be part of that effort. In terms of next steps, I would focus on what would be required to practically use CMR as a backend for building the CASEI frontend:
{type: "collection"}
.Regarding Q1
Query CMR for all collections, instruments and platforms
@edkeeble Yea. Ideally there is, I haven't seen that yet in documentation, and am still looking into this. So far I've had to use existing data in CASEI to create search parameters/keyword terms. For which we need first a list of "items being searched for."
Regarding
Query CMR for all collections, instruments and platforms
@praveenphatate made this document going over some of the CMR direct curl/http queries.
Notebook listed above has been updated
Made some further updates to this exploration @edkeeble and logged them here Repository: NASA-IMPACT/casei-cmr-explorer
Please reach out to @praveenphatate if I am OOO. @praveenphatate is a great resource who knows where we went with this exploration of CMR for CASEI.
Regarding your other questions @edkeeble Question 1 seems not to be feasible in our opinion because it would require filtering all collections upon many different attributes which may or may not present in JSON. Answering Question 2 we weren’t sure how to get an answer to that.
JSON version of the collection's location page does not contain the same parameters - in the example https://cmr.earthdata.nasa.gov:443/search/concepts/C1977826980-GHRC_DAAC.json, the Instruments tag is completely missing, as are other subtags.
That being said, XML formats returned from CMR are fully comprehensive.
XML format can be returned from collections Location XML tag. Each collection has N number of collection data Locations. These Locations then have the XML metadata that CASEI needs - including nearly all of information that can be found on CASEI's individual Campaign, Platform, and Instrument pages.
We believe that there is a requirement to maintain a "reference document" for CASEI in order for there to be an efficient means to query CMR. Filtering from all collections does not seem feasible. The "reference document" should essentially be a list of platforms, instruments, campaigns, and projects that will be viewable in CASEI. It can be maintained and updates in a very minimal form where it is simply used to Query CRM either at CASEI's build time in CI or periodically if there is a static database of CMR to CASEI interfaces.
The solutions described here do not require storing of JSON or XML objects to maintain CASEI. CASEI's data requirements can be satisfied "on the fly" within a CI/CD build script.
I'm just adding a note to close out this technical exploration.
Can we actually build CASEI using CMR as a backend?
Follow ups I think this issue fits into a broader technical discovery that I'll create an EPIC for that would ideally help us answer some additional questions such as:
Generally, how can we document all these loose thoughts and recommendations from the current technical team in a helpful and simple way that makes it easy for both the ADMG leadership and ESDS developers make decisions about what the future of CASEI is going to look like?
Context
ADMG transition/hand-off talks with ESDS have started. But there is still uncertainty around both the timeline and future vision for CASEI after handing off to ESDS, which then makes it hard for us to prepare for that future. In an ideal world it seems that ESDS would like CASEI's metadata to come from CMR instead of our backend so that they can maintain one system. However the shape of CASEI data may/may not be the same as that of CMR so there are details that may not be clarified until time is given to really understanding both systems.
Problem
We are familiar with our system, but not CMR, which makes it hard to have productive conversations.
Acceptance Criteria
Light activity: DevSeed to tool up on our CMR knowledge to help support discussions. Be aware that CASEI is about airborne vs. CMR is geared towards satellites.
These conclusions can be placed in a document for now, but later likely to be included in the shared wiki with ESDS. https://github.com/NASA-IMPACT/admg-backend/issues/538