GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
661 stars 103 forks source link

Create ISO19115-3 reader for MDTranslator #4639

Closed btylerburton closed 2 months ago

btylerburton commented 9 months ago

User Story

In order to transform an ISO-19115 source to DCAT-US, datagovteam needs to work with the MDTranslator team to implement an ISO-19115 reader for the MDTranslator.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

ISO to DCATUS Progress

(source)

program code and bureau code aren't required in the non-federal version of DCATUS. there's also no mapping from iso to this so we're skipping them.

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

MDTranslator already has an ISO writer, so the work of implementing a reader will involve porting that transformer from writer to reader, applying adequate testing, and then confirming that it works by testing it on a live source.

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch ( deployment )

Sketch ( feature merge with upstream )

Sketch ( feature high-level )

Sketch ( feature low-level )

Resources

What mdjson/internal object properties need to be populated?

anything less and we risk being confused by data not being transformed simply because it's not accounted for

rshewitt commented 9 months ago

i'd like to do this work if it's cool with the team

rshewitt commented 5 months ago

dcatus writer branch remains as a draft pr with 185 commits and 248 file changes going to develop. the latest release branch (2.19.0) was merged into the dcatus writer branch 3 weeks ago.

rshewitt commented 5 months ago

iso_19115_3 writer

rshewitt commented 5 months ago

the mdtranslate transformation process involves 3 schema's.

since there's an existing iso 19115-3 writer that means schema 2 -> 3 exists in some fashion. what we're trying to do is get from schema 1 -> 2. i wouldn't be surprised if it's not that simple though and more work may need to be done on 2 -> 3

rshewitt commented 5 months ago

iso reader first commit. I successfully ran a transformation using

bundle exec mdtranslator translate [FILE] -r=iso19115_3 -w=dcat_us

where [FILE] = xml. linking on gdrive because github doesn't support .xml attachments in comments apparently.

there's no information actually being transformed so that needs to be done. the only thing that it does is read the xml and prepare the internal metadata object ( mentioned in my previous comment as schema 2 ) with enough information for the process to complete.

rshewitt commented 5 months ago

existing work on iso 19115-3 reader

rshewitt commented 5 months ago
1410 files inspected, 15857 offenses detected, 2618 offenses autocorrectable

there's too many things to fix when linting the entire repo. i'm not great at bulk text editing either so i'm just going to apply the linting to our addition(s)

rshewitt commented 5 months ago

schema resources

rshewitt commented 5 months ago

mdtranslator has processing flags in the output. each of these have a pass and message flag ( e.g. readerStructurePass & readerStructureMessags )

rshewitt commented 4 months ago

discussed previous prs ( 07/03/24 )

rshewitt commented 4 months ago

discussed previous prs ( 07/17/24 )

Bagesary commented 4 months ago

Refactoring is in progress.

rshewitt commented 4 months ago

a rule i'm implementing in my next minor refactor is explicitly indicating whether something is optional or required and providing the schema element as evidence. for example,

# :title (required)
# <element name="title" type="gco:CharacterString_PropertyType">
title = xCitation.xpath(@@titleXPath)[0]
if title.nil?
   msg = 'WARNING: ISO19115-3 reader: element \'cit:title\' '\
      'is missing in \'cit:CI_Citation\''
   hResponseObj[:readerExecutionMessages] << msg
   hResponseObj[:readerExecutionPass] = false
   return nil
end

cit:title is a required element within the cit:CI_Citation indicated by the schema element provided as a comment. when an xml element doesn't indicate a minimum number of occurrences (e.g. minOccurs="0") the default is 1 (source) meaning it's required. providing the schema element as a comment provides evidence as to why something is optional/required rather than just assuming it's correct.

rshewitt commented 4 months ago

we're prioritizing the information dcat needs instead of making the iso reader feature complete. this means the iso19115-3 reader will store information in the mbJson according to how the dcatus writer wants it. it's not considering how other writers want the data to be so this feature is rigid but fits our needs. @Bagesary

rshewitt commented 4 months ago

here's the exhaustive list of "sections" (i.e. modules) for the dcatus writer. there's 29. this doesn't reflect anything within the files themselves.

Bagesary commented 4 months ago

@rshewitt to bring Chris and Jonathan up to speed on the roadblocks

rshewitt commented 3 months ago

some refactoring that might be needed as we further test the iso reader include:

all of these are straight forward and not time consuming to resolve

rshewitt commented 3 months ago

LandingPage, SystemOfRecords, DescribedByType, AccrualPeriodicity, & PrimaryITInvestmentUII are done in my recent work (haven't pushed to remote) which puts us at 80% complete. the remaining properties require further investigation into how contacts are managed.

:contacts is a root level property in the internal object but contact information in the iso document occurs as either an individual or organization which falls within a responsible party which the mdb:contact element is. so when processing individual/organization elements the contact information within it needs to bubble up to the root level. there's helper functions which deal with contacts.

rshewitt commented 3 months ago

there's no evaluation of required fields in the dcatus writer. since every writer has that kind of functionality i would assume that work still need to be done.

rshewitt commented 3 months ago

test of full translation of iso19115-3 to dcatus!

Bagesary commented 3 months ago

Reid will review this ticket to check for outstanding work and create appropriate new tickets. This ticket must be linked to all the children tickets

Bagesary commented 3 months ago

@rshewitt will check with @jbrown-xentity if all the outstanding tickets have been created and included in the Harvestor document. Then this ticket is good to close.

Bagesary commented 3 months ago

Harvestor document : https://docs.google.com/document/d/1RMHSXSTm2ndr_jAEDHsoT1aRV6Ig4Oq2vzT4Uj6NOxk/edit#heading=h.78eh6qtk0voh