MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

xml2kvp: XPath to mapped field #217

Open ghukill opened 6 years ago

ghukill commented 6 years ago

The configurations allow for a great amount of control and shaping of fields, but anticipating instances where specific fields will be desired where the rules to capture them would impede or create other undesirable fields.

Say, for instance, that the XML is littered with attributes that are not of interest. But, there are two fields foo type="primary" and foo type="preview" that would be helpful to map separately. By ignoring all attributes, would lose the distinction between the two and would get a mutlivalued value for foo only.

It would be nice to offer an argument or precise mappings, after/before all other parameters. Some possible options...

XPath

Provide dictionary of xpath expressions : target field name pairs, e.g.:

`//foo[@type='primary']` : `foo_primary`,
`//foo[@type='preview']` : `foo_preview`

If these were run before normal mapping, all elements that match XPath expressions could be removed before normal parsing with the motivation being they had been captured already (or key off flag remove_copied_key that did something similar for copy_to and copy_to_regex).

This would dramatically slow down the process, as the XML would have to be parsed by means other than xmltodict which is currently powering xml2kvp. But, it might be worth the precision, and would be optional.

Cherry-pick fields pre-processing

Run preliminary xml_to_kvp() with defaults, allow user to cherry-pick these keys before processing.

To avoid explosion of arguments, could conceivably create a new argument group along the lines of pre_processing that could house select arguments like copy_to or copy_to_regex. Look something like:

"pre_processing": {
    "copy_to":{
        "foo___@type=primary":"foo_primary",
        "foo___@type=preview":"foo_preview",
    }
}

If pre_processing is not None, proceed with preliminary, default parse, which should expose all combinations of elements and attributes, and allow this kind of cherry-picking.

ghukill commented 6 years ago

With the newly created boolean for include_all_attributes and the list of attributes to include include_attributes, the second approach of "cherry-picking" doesn't look quite as appealing anymore. And, some of that functionality is covered with copy_to and copy_to_regex.

However, the XPath values still seems like it might be helpful. Keeping issue open and renaming.