Closed btylerburton closed 2 months ago
i'd like to do this work if it's cool with the team
dcatus writer branch remains as a draft pr with 185 commits and 248 file changes going to develop. the latest release branch (2.19.0) was merged into the dcatus writer branch 3 weeks ago.
the mdtranslate transformation process involves 3 schema's.
since there's an existing iso 19115-3 writer that means schema 2 -> 3 exists in some fashion. what we're trying to do is get from schema 1 -> 2. i wouldn't be surprised if it's not that simple though and more work may need to be done on 2 -> 3
iso reader first commit. I successfully ran a transformation using
bundle exec mdtranslator translate [FILE] -r=iso19115_3 -w=dcat_us
where [FILE] = xml. linking on gdrive because github doesn't support .xml attachments in comments apparently.
there's no information actually being transformed so that needs to be done. the only thing that it does is read the xml and prepare the internal metadata object ( mentioned in my previous comment as schema 2 ) with enough information for the process to complete.
existing work on iso 19115-3 reader
1410 files inspected, 15857 offenses detected, 2618 offenses autocorrectable
there's too many things to fix when linting the entire repo. i'm not great at bulk text editing either so i'm just going to apply the linting to our addition(s)
mdtranslator has processing flags in the output. each of these have a pass and message flag ( e.g. readerStructurePass & readerStructureMessags )
discussed previous prs ( 07/03/24 )
discussed previous prs ( 07/17/24 )
Refactoring is in progress.
a rule i'm implementing in my next minor refactor is explicitly indicating whether something is optional
or required
and providing the schema element as evidence. for example,
# :title (required)
# <element name="title" type="gco:CharacterString_PropertyType">
title = xCitation.xpath(@@titleXPath)[0]
if title.nil?
msg = 'WARNING: ISO19115-3 reader: element \'cit:title\' '\
'is missing in \'cit:CI_Citation\''
hResponseObj[:readerExecutionMessages] << msg
hResponseObj[:readerExecutionPass] = false
return nil
end
cit:title
is a required element within the cit:CI_Citation
indicated by the schema element provided as a comment. when an xml element doesn't indicate a minimum number of occurrences (e.g. minOccurs="0"
) the default is 1 (source) meaning it's required. providing the schema element as a comment provides evidence as to why something is optional/required rather than just assuming it's correct.
we're prioritizing the information dcat needs instead of making the iso reader feature complete. this means the iso19115-3 reader will store information in the mbJson according to how the dcatus writer wants it. it's not considering how other writers want the data to be so this feature is rigid but fits our needs. @Bagesary
here's the exhaustive list of "sections" (i.e. modules) for the dcatus writer. there's 29. this doesn't reflect anything within the files themselves.
[ ] dcat_us_access_level.rb
mbJSON[:metadata][:resourceInfo][:constraints]
mbJSON[:metadata][:metadataInfo][:metadataConstraints]
. it's not in resourceInfo
but metadataInfo
. this suggests a duplication of data in different places to ensure writes work correctly. [ ] dcat_us_access_url.rb
[ ] dcat_us_accrualPeriodicity.rb
[ ] dcat_us_bureau_code.rb
[ ] dcat_us_contact_point.rb
[ ] dcat_us_dcat_us.rb
[ ] dcat_us_described_by.rb
[ ] dcat_us_described_by_type.rb
[ ] dcat_us_description.rb
[ ] dcat_us_distribution.rb
[ ] dcat_us_download_url.rb
[ ] dcat_us_identifier.rb
[ ] dcat_us_is_part_of.rb
[ ] dcat_us_issued.rb
[ ] dcat_us_keyword.rb
[ ] dcat_us_landing_page.rb
[ ] dcat_us_language.rb
[ ] dcat_us_license.rb
[ ] dcat_us_media_type.rb
[ ] dcat_us_modified.rb
[ ] dcat_us_primaryITInvestmentUII.rb
[ ] dcat_us_program_code.rb
[ ] dcat_us_publisher.rb
[ ] dcat_us_references.rb
[ ] dcat_us_rights.rb
[ ] dcat_us_spatial.rb
[ ] dcat_us_system_of_records.rb
[ ] dcat_us_temporal.rb
[ ] dcat_us_theme.rb
@rshewitt to bring Chris and Jonathan up to speed on the roadblocks
some refactoring that might be needed as we further test the iso reader include:
unpack
functions to the associated internal metadata object rather than nil. this is in response to my conversation with johnathan.removing the initial check for the metadata block the unpack
functions are responsible for parsing. here's an example of what that looks like. This is a pattern followed in many of the modules.
# ...
def self.unpack(args)
# MD Medium (required)
xMedium = xParent.xpath(@@mediumXPath)[0]
if xMedium.nil?
msg = "WARNING: ISO19115-3 reader: element \'mrd:MD_Medium\' '\
'is missing in #{xParent.name}"
hResponseObj[:readerExecutionMessages] << msg
hResponseObj[:readerExecutionPass] = false
return nil
end
# ...
all of these are straight forward and not time consuming to resolve
LandingPage, SystemOfRecords, DescribedByType, AccrualPeriodicity, & PrimaryITInvestmentUII are done in my recent work (haven't pushed to remote) which puts us at 80%
complete. the remaining properties require further investigation into how contacts are managed.
:contacts
is a root level property in the internal object but contact information in the iso document occurs as either an individual or organization which falls within a responsible party which the mdb:contact
element is. so when processing individual/organization elements the contact information within it needs to bubble up to the root level. there's helper functions which deal with contacts.
there's no evaluation of required fields in the dcatus writer. since every writer has that kind of functionality i would assume that work still need to be done.
Reid will review this ticket to check for outstanding work and create appropriate new tickets. This ticket must be linked to all the children tickets
@rshewitt will check with @jbrown-xentity if all the outstanding tickets have been created and included in the Harvestor document. Then this ticket is good to close.
User Story
In order to transform an ISO-19115 source to DCAT-US, datagovteam needs to work with the MDTranslator team to implement an ISO-19115 reader for the MDTranslator.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
ISO to DCATUS Progress
(source)
program code and bureau code aren't required in the non-federal version of DCATUS. there's also no mapping from iso to this so we're skipping them.
Background
[Any helpful contextual notes or links to artifacts/evidence, if needed]
MDTranslator already has an ISO writer, so the work of implementing a reader will involve porting that transformer from writer to reader, applying adequate testing, and then confirming that it works by testing it on a live source.
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch ( deployment )
Sketch ( feature merge with upstream )
Sketch ( feature high-level )
Sketch ( feature low-level )
Resources
What mdjson/internal object properties need to be populated?
:schema
:contacts
:metadata
:metadataInfo
:metadataIdentifier
:parentMetadata
:defaultMetadataLocale
:otherMetadataLocales
:metadataContacts
:roleName
:roleExtents
:description
:geographicExtents
:description
:containsData
:identifier
:boundingBox
:geographicElements
:nativeGeoJson
:computedBbox
:temporalExtents
:verticalExtents
:parties
:metadataDates
:metadataLinkages
:metadataConstraints
:metadataMaintenance
:alternateMetadataReferences
:metadataStatus
:extensions
:resourceInfo
:lineageInfo
:distributorInfo
:associatedResources
:additionalDocuments
:funding
:dataQuality
:dataDictionaries
:metadataRepositories
anything less and we risk being confused by data not being transformed simply because it's not accounted for