Compare harvested ISO19115 and transformed DCATUS in catalog-dev

rshewitt commented 2 months ago

User Story

In order to identify changes between documents, datagov wants to harvest an ISO19115 document and its transformed counterpart (DCATUS) on catalog-dev.

identify changes in content between an ISO19115 document and its transformed counterpart (DCATUS)

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

[ ] GIVEN an ISO19115-3 document that has been successfully transformed into DCATUS using MDTranslator \ WHEN these documents are successfully harvested in catalog-dev \ THEN changes between them will be documented \

Background

Resources

valid ISO19139 doc

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

rshewitt commented 2 months ago

DCATUS transformed data on catalog-dev
working on getting the original ISO19115-3 xml file harvested into catalog-dev. Attempted to harvest the source but ran into a transformation issue. Catalog validates single geospatial xml files against the iso19139ngdc schema. I isolated the validation logic locally and the iso19139ngdc schema can't seem to find the MD_Metadata root element. looking more into this.
ISO19139 is the xml implementation of ISO19115-1
ISO19115-3 is another xml implementation of ISO19115-1. It appears this specification will replace ISO19139 (source)

rshewitt commented 2 months ago

xml validation can't find where MD_Metadata is declared in the mdb namespace which explains the issue mentioned in my last comment. mdb is declared via

<mdb:MD_Metadata xmlns:mdb="http://standards.iso.org/iso/19115/-3/mdb/2.0">
<!-- other content --> 
</mdb:MD_Metadata>

^ all the ISO19115-3 fixtures in mdtranslator do this. The root element declaration Chris MacDermaid gave me imports the namespace like that too.

I use a xml parsing extension on VSC. It'll process the document according to the schemas and tell me when something is wrong. Importing the mdb namespace mentioned above causes an error. However, when I removed that declaration and add http://standards.iso.org/iso/19115/-3/mdb/2.0 https://standards.iso.org/iso/19115/-3/mdb/2.0/metadataBase.xsd to xsi:schemaLocation my xml processor doesn't complain anymore. Like so...

<mdb:MD_Metadata xsi:schemaLocation="http://standards.iso.org/iso/19115/-3/mdb/2.0 https://standards.iso.org/iso/19115/-3/mdb/2.0/metadataBase.xsd">
<!-- other content --> 
</mdb:MD_Metadata>

^ this solution doesn't resolve the underlying issue caused by xml validation in python

rshewitt commented 2 months ago

okay so a breakdown of some relevant ISO standards (source)

Content Standard (these aren't implementation-specific meaning these can be xml files, stored in geodatabases, or other GIS formats)

ISO19115 (2007)
- this is a content standard not specific to any implementation
ISO19115-1 (2015)
- the replacement of ISO19115
ISO19115-2 (2009)
- extension to cover imagery/gridded data
  XML Implementation Standard (these are implementation-specific)
ISO19139 (2007)
- this is the ISO19115 xml implementation
ISO19115-3 (2023)
- this is the ISO19115-1 xml implementation

catalog uses the NGDC-specific implementation of ISO19139, labelled as iso19139ngdc, for ISO19115 validation (source)

rshewitt commented 2 months ago

pausing on this ticket. need group discussion on the metadata we manage and where we wanna go. huddled with @btylerburton & @FuhuXia on getting a distribution count of ISO standards we currently manage ( e.g. ISO19115, ISO19115-1, ISO19115-2, ISO19115-3 )

FuhuXia commented 2 months ago

Have a script ready to get all WAF/WAF-collection harvest sources, their dataset counts, and sample xml file for its standard analysis.

Here is the result. result.txt

rshewitt commented 2 months ago

Supported spatial document schemas/standards

ISO 19115 Metadata ( ISO 19139 NGDC XSD )
FGDC minimal validation
FGDC CSDGM Version 2.0, 1998 ( FGDC-STD-001-1998)
FGDC CSDGM Biological Data Profile (FGDC-STD-001.1-1999)
FGDC CSDGM Metadata Profile for Shoreline Data (FGDC-STD-001.2-2001)
FGDC Extension for Remote Sensing (FGDC-STD-012-2002)

rshewitt commented 2 months ago

Out of the 470 WAF/WAF-collection harvest sources this is the breakdown of documents per schema. so we need to be able to transform those 3 schemas into DCATUS for harvester 2.0. a sample xml was taken from each collection. the schema of that xml was assumed to apply to the entire collection.

291017 are ISO19115-2 ---------\
5262 are FGDC-STD-001-1998 ---[ ISO19139 NGDC we're here currently on catalog ] --- ISO19115-3 --- DCATUS
2219 are ISO19139 -------------/

calculated via script

rshewitt commented 2 months ago

@FuhuXia is getting me a list of all the data provider xml urls we currently harvest. from that i'll count how many schemas we process.

FuhuXia commented 2 months ago

script updated to show ISO vs FGDC.

New list attached. result.2.txt

Bagesary commented 2 months ago

Moving this to H2.0 backlog because of reprioritization.

GSA / data.gov