GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
628 stars 99 forks source link

Compare harvested ISO19115 and transformed DCATUS in catalog-dev #4850

Open rshewitt opened 2 months ago

rshewitt commented 2 months ago

User Story

In order to identify changes between documents, datagov wants to harvest an ISO19115 document and its transformed counterpart (DCATUS) on catalog-dev.

identify changes in content between an ISO19115 document and its transformed counterpart (DCATUS)

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

Resources

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

rshewitt commented 2 months ago
rshewitt commented 2 months ago

xml validation can't find where MD_Metadata is declared in the mdb namespace which explains the issue mentioned in my last comment. mdb is declared via

<mdb:MD_Metadata xmlns:mdb="http://standards.iso.org/iso/19115/-3/mdb/2.0">
<!-- other content --> 
</mdb:MD_Metadata>

^ all the ISO19115-3 fixtures in mdtranslator do this. The root element declaration Chris MacDermaid gave me imports the namespace like that too.

I use a xml parsing extension on VSC. It'll process the document according to the schemas and tell me when something is wrong. Importing the mdb namespace mentioned above causes an error. However, when I removed that declaration and add http://standards.iso.org/iso/19115/-3/mdb/2.0 https://standards.iso.org/iso/19115/-3/mdb/2.0/metadataBase.xsd to xsi:schemaLocation my xml processor doesn't complain anymore. Like so...

<mdb:MD_Metadata xsi:schemaLocation="http://standards.iso.org/iso/19115/-3/mdb/2.0 https://standards.iso.org/iso/19115/-3/mdb/2.0/metadataBase.xsd">
<!-- other content --> 
</mdb:MD_Metadata>

^ this solution doesn't resolve the underlying issue caused by xml validation in python

rshewitt commented 2 months ago

okay so a breakdown of some relevant ISO standards (source)

Content Standard (these aren't implementation-specific meaning these can be xml files, stored in geodatabases, or other GIS formats)

catalog uses the NGDC-specific implementation of ISO19139, labelled as iso19139ngdc, for ISO19115 validation (source)

rshewitt commented 2 months ago

pausing on this ticket. need group discussion on the metadata we manage and where we wanna go. huddled with @btylerburton & @FuhuXia on getting a distribution count of ISO standards we currently manage ( e.g. ISO19115, ISO19115-1, ISO19115-2, ISO19115-3 )

FuhuXia commented 2 months ago

Have a script ready to get all WAF/WAF-collection harvest sources, their dataset counts, and sample xml file for its standard analysis.

Here is the result. result.txt

rshewitt commented 2 months ago

Supported spatial document schemas/standards

rshewitt commented 2 months ago

Out of the 470 WAF/WAF-collection harvest sources this is the breakdown of documents per schema. so we need to be able to transform those 3 schemas into DCATUS for harvester 2.0. a sample xml was taken from each collection. the schema of that xml was assumed to apply to the entire collection.

calculated via script

rshewitt commented 2 months ago

@FuhuXia is getting me a list of all the data provider xml urls we currently harvest. from that i'll count how many schemas we process.

FuhuXia commented 2 months ago

script updated to show ISO vs FGDC.

New list attached. result.2.txt

Bagesary commented 2 months ago

Moving this to H2.0 backlog because of reprioritization.