GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
649 stars 101 forks source link

[Spike] mdTranslator exploration #4200

Closed jbrown-xentity closed 1 year ago

jbrown-xentity commented 1 year ago

Purpose

We want to see if mdTranslator is a useful tool to build off of.

Given above uncertainty, conducting testing is needed to provide factual knowledge on future steps.

3 days of effort has been allocated and once compete, findings will be demonstrated and specific future actions will be decided.

Acceptance Criteria

[ACs should be clearly demo-able/verifiable whenever possible. Try specifying them using BDD.]

Background

https://docs.google.com/document/d/1XzfTrPxu-asJ_55GoeZ2UOJsie9CuCegStS28BAL_40/edit#heading=h.pallknm1j7lu https://github.com/adiwg/mdTranslator https://mdtools.adiwg.org/

Sketch

The developer should be able to fully setup the local dev environment for https://github.com/adiwg/mdTranslator. Ideally as time permits, a new output format should be created such that it can export a DCAT-US (JSON) object with a title and description. If there is still time, we could explore deploying in cloud.gov environment as a new app utilizing cloud.gov Ruby Hello World examples

Jin-Sun-tts commented 1 year ago

This mdTranslator translates the mdJson input into one or more established metadata standards:

reader(from): fgdc, mdJson, sbJson
writer(to): fgdc, html, iso19110, iso19115_1, iso19115_2

we need to add following modules for our source data format and dcat-us output:

reader: iso19115, arcgis
writer: dcat-us

Here is the example to add the write to translate from FGDC XML to DCAT-US(JSON) (title, description only):

require 'jbuilder'
require 'rubygems'
require_relative 'dcatusJson_dataset'

module ADIWG
   module Mdtranslator
      module Writers
         module DcatusJson

            module DcatusJson

               def self.build(intObj, hResponseObj)

                  Jbuilder.new do |json|

                     json.conformsTo 'https://project-open-data.cio.gov/v1.1/schema'
                     json.type 'dcat:Catalog'

                     json.dataset Dataset.build(intObj[:metadata])

                  end
               end # build
            end # DcatusJson

         end
      end
   end
end

dcatusJson_dataset.rb:

require 'jbuilder'
require_relative 'dcatusJson_resourceInfo'

module ADIWG
   module Mdtranslator
      module Writers
         module DcatusJson

            module Dataset

               @Namespace = ADIWG::Mdtranslator::Writers::DcatusJson

               def self.build(hMetadata)
                  resourceInfo = hMetadata[:resourceInfo]
                  hCitation = resourceInfo[:citation]

                  Jbuilder.new do |json|
                     json.title hCitation[:title]
                     json.description resourceInfo[:abstract]
                  end
               end # build
            end # Dataset

         end
      end
   end
end

Below is a sample Python code which could be used to translate FGDC XML files into DCAT-US(JSON) format.

import xmltodict
import json

with open('input.xml') as xml_file:
    data = xml_file.read()

fgdc_dict = xmltodict.parse(data)

dcat_us_dict = {
    'conformsTo': 'https://project-open-data.cio.gov/v1.1/schema',
    '@type': 'dcat:Catalog',
    'dataset': {
        'title': fgdc_dict['metadata']['idinfo']['citation']['citeinfo']['title'],
        'description': fgdc_dict['metadata']['idinfo']['descript']['abstract'],
    }
}

with open('dcatus_output.json', 'w') as json_file:
    json.dump(dcat_us_dict, json_file, indent=4)

This tool includes several features that are not necessary for our specific translation needs. Our goal is to extract the relevant information from input files and generate a DCAT-US format.

And this is only one step in the overall ETL process. It is important to consider additional steps such as validation and remote file handling etc, so implementing a complete process that covers all necessary parts will increase efficiency and make maintenance easier.

Jin-Sun-tts commented 1 year ago

By following the example from cf-hello-worlds, deployed a simple ruby app to cloud.gov sandbox.

Jin-Sun-tts commented 1 year ago

Created GSA fork to save the testing code: https://github.com/GSA/mdTranslator Also pushed a ruby app in cloud.gov sandbox : https://metadata-translator-test-unexpected-squirrel-lo.app.cloud.gov/

jbrown-xentity commented 1 year ago

Looks great, thanks @Jin-Sun-tts !!