geopython / pycsw

pycsw is an OGC CSW server implementation written in Python. pycsw fully implements the OpenGIS Catalogue Service Implementation Specification [Catalogue Service for the Web]. Initial development started in 2010 (more formally announced in 2011). The project is certified OGC Compliant, and is an OGC Reference Implementation. pycsw allows for the publishing and discovery of geospatial metadata via numerous APIs (CSW 2/CSW 3, OpenSearch, OAI-PMH, SRU). Existing repositories of geospatial metadata can also be exposed, providing a standards-based metadata and catalogue component of spatial data infrastructures. pycsw is Open Source, released under an MIT license, and runs on all major platforms (Windows, Linux, Mac OS X). Please read the docs at https://pycsw.org/docs for more information.
https://pycsw.org
MIT License
197 stars 153 forks source link

Issues with pycsw mapping ISO-DIF #657

Open epifanio opened 3 years ago

epifanio commented 3 years ago

Description

Problem: mapping of ISO records to DIF (using GCMD DIF type/subtype vocabulary).

Given an ISO-compliant metadata Record, I encountered some issues in the mapping to DIF at different levels. Listing two examples:

Environment

Steps to Reproduce

Indexing the following ISO Record:

Results in the following DIF profile

The DIF output doesn't match the information available in the original ISO source.

Data Access

Currently the protocols are just the same as the ISO records.

Current DIF output

<dif:Related_URL>
  <dif:URL_Content_Type>
    <dif:Type>OPENDAP:OPENDAP</dif:Type>
  </dif:URL_Content_Type>
  <dif:URL>opendap url</dif:URL> 
  <dif:Description>None</dif:Description>
</dif:Related_URL>

<dif:Related_URL>
  <dif:URL_Content_Type>
    <dif:Type>download</dif:Type>
  </dif:URL_Content_Type>
  <dif:URL>http download url</dif:URL>
  <dif:Description>None</dif:Description>
</dif:Related_URL>

Expected DIF9.7 output

<Related_URL>
  <URL_Content_Type>
    <Type>GET DATA</Type>
    <Subtype>OPENDAP DATA (DODS)</Subtype>
  </URL_Content_Type>
  <URL>opendapurl</URL>
</Related_URL>

<Related_URL>
  <URL_Content_Type>
    <Type>GET SERVICE</Type>
    <Subtype>GET WEB MAP SERVICE (WMS)</Subtype>
  </URL_Content_Type>
  <URL>wmsurl</URL>
</Related_URL>

<Related_URL>
  <URL_Content_Type>
    <Type>GET DATA</Type>
    </URL_Content_Type>
  <URL>Http download url</URL>
</Related_URL>

Dataset landing page

Current ISO output

<gmd:dataSetURI>
   <gco:CharacterString>Dataset landing page</gco:CharacterString>
</gmd:dataSetURI>

As Related_URL using type DATASET LANDING PAGE.

Expected DIF output

<Related_URL>
  <URL_Content_Type>
    <Type>DATASET LANDING PAGE</Type>
    </URL_Content_Type>
  <URL>dataset landing page url</URL>
</Related_URL>

Current DIF output:

<dif:Data_Set_Citation>
   <dif:Dataset_Creator/>
   <dif:Dataset_Release_Date/>
   <dif:Dataset_Publisher/>
   <dif:Data_Presentation_Form/>
</dif:Data_Set_Citation>

Expected DIF output

<Data_Set_Citation>
   <Dataset_Creator>xx</Dataset_Creator>
   <Dataset_Title>xx</Dataset_Title>
   <Dataset_Release_Date>2017-02-23T00:00:00:00Z</Dataset_Release_Date>
   <Dataset_Publisher>xx</Dataset_Publisher>
...
   <Online_Resource>Dataset landing page URI</Online_Resource>
</Data_Set_Citation>

Additional Information

There are other issues related to how the ISO keywords are mapped to DIF in particular the GCMD Science Keywords.

in ISO we have:

<?xml version="1.0"?>
<gmd:descriptiveKeywords>
  <gmd:MD_Keywords>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Temperature &gt; Surface Temperature &gt; Air Temperature
</gco:CharacterString>
    </gmd:keyword>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Winds &gt; Surface Winds
</gco:CharacterString>
    </gmd:keyword>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Water Vapor
</gco:CharacterString>
    </gmd:keyword>
    <gmd:thesaurusName>
      <gmd:CI_Citation>
        <gmd:title>
          <gco:CharacterString>gcmd</gco:CharacterString>
        </gmd:title>
      </gmd:CI_Citation>
    </gmd:thesaurusName>
  </gmd:MD_Keywords>
</gmd:descriptiveKeywords>

see reference ISO

As this is too complicated I would try to get only the GCMD thesauri, thus I need to map all ISO entries to Parameter in this structure:

<Parameters>
<Category>EARTH SCIENCE</Category>
<Topic>SPECTRAL/ENGINEERING</Topic>
<Term>RADAR</Term>
<Variable_Level_1>RADAR BACKSCATTER</Variable_Level_1>
</Parameters>

See http://metadata.nersc.no/oai?verb=ListRecords&metadataPrefix=dif for example

epifanio commented 3 years ago

Regarding the last part of the issue, the one related to the keywords issue - to distinguish between keywords in ISO with and without a thesaurus_name, will it make sense to have a column (which can be empty) to sp[ecify the 'dialect'/'flavour' of the ISO record ... in my case GCMD? -- then try to add some logic in the core code to distinguish between keywords with/without a thesaurs_name .. which will affect the transformation into a specific output profile?

epifanio commented 3 years ago

I may have found a little hack to tune the output the way I needed, by modifying 'dif.py':

    # keywords
    val = util.getqattr(result, context.md_core_model['mappings']['pycsw:Keywords'])

    if val:
        for kw in val.split(','):
            if len(kw.split(">")) >= 2:
                values = kw.split(">")
                parameters = etree.SubElement(node, util.nspath_eval('dif:Parameters', NAMESPACES))  # .text = kw
                etree.SubElement(parameters, util.nspath_eval('dif:Category', NAMESPACES)).text = values[0]
                etree.SubElement(parameters, util.nspath_eval('dif:Topic', NAMESPACES)).text = values[1]
                etree.SubElement(parameters, util.nspath_eval('dif:Term', NAMESPACES)).text = values[2]
                for i,v in enumerate(values[3:]):
                    etree.SubElement(parameters, util.nspath_eval(f'dif:Variable_Level_{i+1}', NAMESPACES)).text = v
            else:
                etree.SubElement(node, util.nspath_eval('dif:Keywords', NAMESPACES)).text = kw

Note, this will work only for my specific case where I am sure the GCMD keywords I need to parse have all the > symbol as splitter.

The code above will return:

<dif:Parameters>
    <dif:Category>Earth Science</dif:Category>
    <dif:Topic>Atmosphere</dif:Topic>
    <dif:Term>Atmospheric radiation</dif:Term>
    <dif:Variable_Level_1>Reflectance</dif:Variable_Level_1>
</dif:Parameters>

From a ISO keywords like:

<gmd:keyword>
    <gco:CharacterString>
        EARTH SCIENCE > Atmosphere > Atmospheric Winds > Surface Winds
    </gco:CharacterString>
</gmd:keyword>