Defra-Data-Science-Centre-of-Excellence / sds-data-model

A common data model for the Spatial Data Science unit
https://defra-data-science-centre-of-excellence.github.io/sds-data-model/
MIT License
0 stars 0 forks source link

Parse Gemini 2.3 metadata: "Mandatory", multiple-value elements: dictionaries #54

Open TimAshelford opened 1 year ago

TimAshelford commented 1 year ago

The following have a single example but need to return a dictionary:

Originally posted by @EFT-Defra in https://github.com/Defra-Data-Science-Centre-of-Excellence/sds-data-model/issues/35#issuecomment-1181543910

TimAshelford commented 1 year ago

See also #53 related task.

JordanPinder-Defra commented 1 year ago

Bit of context to the issue.

This pull request is a good start for background.

Relates to issues 35 and 53.

Previous work has taken a given xpath to return a single text value or a tuple of text values. This issue aims to return values from a given xpath with multiple tags & values to a dictionary-like structure, where the tag is the key and the string is the value.

Below is an update on the approach I've taken where I iteratively work through a list of xml tags to extract values from and append to a dictionary.

Firstly import libraries and xml arguments.

from typing import Dict, List, Union, Any
from lxml.etree import Element, parse
from metadata import _get_value, _get_xpath

xml = parse("tests/test_metadata/ramsar.xml")
root_element = xml.getroot()
namespaces = root_element.nsmap

Set the xpath elements to extract from.

GEOGRAPHIC_EXTENT_XPATH = [
    "gmd:identificationInfo",
    "gmd:MD_DataIdentification",
    "gmd:extent",
    "gmd:EX_Extent",
    "gmd:geographicElement",
    "gmd:EX_GeographicBoundingBox"
    ]

bbox_tags = [
    "gmd:westBoundLongitude", 
    "gmd:eastBoundLongitude", 
    "gmd:southBoundLatitude", 
    "gmd:northBoundLatitude"
    ]

Here, I'm focusing on the geographic extent where the tags to extract values from are in bbox_tags.

Next, define _get_nested_value.

def _get_nested_value(
    xpath: Union[str, List[str]],
    key_name: str,
    xpath_child = None
    ):

    xpath_tmp = _get_xpath(
        [key_name,
        xpath_child]
    )

    xpath = _get_xpath(
        xpath + 
        [xpath_tmp]
        )

    value = _get_values(
            root_element=root_element,
            xpath=xpath,
            namespaces=namespaces
    )

    # Remove empty response from tuple
    value = [value for value in value if value]

    return(value)

The inputs here are:

Now, define the _get_dict:

def _get_dict(
    key_name: List[str],
    xpath: Union[str, List[str]],
    xpath_child = None
):
    d = dict()

    for x in key_name:
        key = x.split(":", 1)[1]
        value = _get_nested_value(
            key_name=x, 
            xpath=xpath, 
            xpath_child=xpath_child
            )
        d[key] = value

    return(d)

This takes the same arguments as _get_nested_value.

Get the bbox stored in a dictionary:

_get_dict(
    key_name=bbox_tags,
    xpath=GEOGRAPHIC_EXTENT_XPATH,
    xpath_child="gco:Decimal/text()"
    )

Which should return:

{'westBoundLongitude': ['-6.41736'],
 'eastBoundLongitude': ['2.05827'],
 'southBoundLatitude': ['49.8625'],
 'northBoundLatitude': ['55.7447']}

One problem I'm having currently is when there are multiple tags below a given xpath which all have the same name; for example, gmd:distributionInfo/gmd:MD_Distribution/gmd:transferOptions/gmd:MD_DigitalTransferOptions/gmd:onLine can have multiple gmd:CI_OnlineResource links which include a Linkage/URL, protocol/CharacterString and name/CharacterString response underneath them. So how can we differentiate between tags with the same names?

Another issue is extracting from a namespace which is not defined in the XML file, such as trying to extract values from the TimePeriod tag. For example, the below will run...

xml = parse("tests/test_metadata/ramsar.xml")
root_element = xml.getroot()
namespaces = root_element.nsmap

TEMPORAL_EXTENT_XPATH = [
    "gmd:identificationInfo",
    "gmd:MD_DataIdentification",
    "gmd:extent",
    "gmd:EX_Extent",
    "gmd:temporalElement",
    "gmd:EX_TemporalExtent",
    "gmd:extent"
    "/text()"
    ]

_get_values(
    root_element=root_element,
    xpath=TEMPORAL_EXTENT_XPATH,
    namespaces=namespaces,
    )

...and return...

('', '', '1970-01-01', '', '2099-12-31', '', '')

But as soon as you include the gml namespace...


TEMPORAL_EXTENT_XPATH = [
    "gmd:identificationInfo",
    "gmd:MD_DataIdentification",
    "gmd:extent",
    "gmd:EX_Extent",
    "gmd:temporalElement",
    "gmd:EX_TemporalExtent",
    "gmd:extent",
    "gml:TimePeriod",
    "/text()"
    ]

_get_values(
    root_element=root_element,
    xpath=TEMPORAL_EXTENT_XPATH,
    namespaces=namespaces,
    )

...it returns the error...

XPathEvalError: Undefined namespace prefix