Open TimAshelford opened 1 year ago
See also #53 related task.
Bit of context to the issue.
This pull request is a good start for background.
Previous work has taken a given xpath to return a single text value or a tuple of text values. This issue aims to return values from a given xpath with multiple tags & values to a dictionary-like structure, where the tag is the key and the string is the value.
Below is an update on the approach I've taken where I iteratively work through a list of xml tags to extract values from and append to a dictionary.
Firstly import libraries and xml
arguments.
from typing import Dict, List, Union, Any
from lxml.etree import Element, parse
from metadata import _get_value, _get_xpath
xml = parse("tests/test_metadata/ramsar.xml")
root_element = xml.getroot()
namespaces = root_element.nsmap
Set the xpath
elements to extract from.
GEOGRAPHIC_EXTENT_XPATH = [
"gmd:identificationInfo",
"gmd:MD_DataIdentification",
"gmd:extent",
"gmd:EX_Extent",
"gmd:geographicElement",
"gmd:EX_GeographicBoundingBox"
]
bbox_tags = [
"gmd:westBoundLongitude",
"gmd:eastBoundLongitude",
"gmd:southBoundLatitude",
"gmd:northBoundLatitude"
]
Here, I'm focusing on the geographic extent where the tags to extract values from are in bbox_tags
.
Next, define _get_nested_value
.
def _get_nested_value(
xpath: Union[str, List[str]],
key_name: str,
xpath_child = None
):
xpath_tmp = _get_xpath(
[key_name,
xpath_child]
)
xpath = _get_xpath(
xpath +
[xpath_tmp]
)
value = _get_values(
root_element=root_element,
xpath=xpath,
namespaces=namespaces
)
# Remove empty response from tuple
value = [value for value in value if value]
return(value)
The inputs here are:
xpath
: a List
of values to create an xpath
above the tags (e.g. GEOGRAPHIC_EXTENT_XPATH
);key_name
: string
value of tags to extract from (e.g. bbox_tags
);xpath_child
: any subsequent path below the key_name
argument to extract the value, in this example it would be gco:Decimal/text()
Now, define the _get_dict
:
def _get_dict(
key_name: List[str],
xpath: Union[str, List[str]],
xpath_child = None
):
d = dict()
for x in key_name:
key = x.split(":", 1)[1]
value = _get_nested_value(
key_name=x,
xpath=xpath,
xpath_child=xpath_child
)
d[key] = value
return(d)
This takes the same arguments as _get_nested_value
.
Get the bbox stored in a dictionary:
_get_dict(
key_name=bbox_tags,
xpath=GEOGRAPHIC_EXTENT_XPATH,
xpath_child="gco:Decimal/text()"
)
Which should return:
{'westBoundLongitude': ['-6.41736'],
'eastBoundLongitude': ['2.05827'],
'southBoundLatitude': ['49.8625'],
'northBoundLatitude': ['55.7447']}
One problem I'm having currently is when there are multiple tags below a given xpath
which all have the same name; for example, gmd:distributionInfo/gmd:MD_Distribution/gmd:transferOptions/gmd:MD_DigitalTransferOptions/gmd:onLine
can have multiple gmd:CI_OnlineResource
links which include a Linkage/URL
, protocol/CharacterString
and name/CharacterString
response underneath them. So how can we differentiate between tags with the same names?
Another issue is extracting from a namespace which is not defined in the XML
file, such as trying to extract values from the TimePeriod
tag. For example, the below will run...
xml = parse("tests/test_metadata/ramsar.xml")
root_element = xml.getroot()
namespaces = root_element.nsmap
TEMPORAL_EXTENT_XPATH = [
"gmd:identificationInfo",
"gmd:MD_DataIdentification",
"gmd:extent",
"gmd:EX_Extent",
"gmd:temporalElement",
"gmd:EX_TemporalExtent",
"gmd:extent"
"/text()"
]
_get_values(
root_element=root_element,
xpath=TEMPORAL_EXTENT_XPATH,
namespaces=namespaces,
)
...and return...
('', '', '1970-01-01', '', '2099-12-31', '', '')
But as soon as you include the gml
namespace...
TEMPORAL_EXTENT_XPATH = [
"gmd:identificationInfo",
"gmd:MD_DataIdentification",
"gmd:extent",
"gmd:EX_Extent",
"gmd:temporalElement",
"gmd:EX_TemporalExtent",
"gmd:extent",
"gml:TimePeriod",
"/text()"
]
_get_values(
root_element=root_element,
xpath=TEMPORAL_EXTENT_XPATH,
namespaces=namespaces,
)
...it returns the error...
XPathEvalError: Undefined namespace prefix
The following have a single example but need to return a dictionary:
Originally posted by @EFT-Defra in https://github.com/Defra-Data-Science-Centre-of-Excellence/sds-data-model/issues/35#issuecomment-1181543910