MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

xml2kvp: split multivalued values to multiple fields #222

Open ghukill opened 6 years ago

ghukill commented 6 years ago

Addressed concatenating mutlivalued fields to a single value, while this proposes to create multiple fields based on a multivalued value split with numeric notation.

e.g. mods_subject_topic : ['horse','goober','tronic'], would convert to three fields:

'mods_subject_topic_0' : 'horse',
'mods_subject_topic_1' : 'goober',
'mods_subject_topic_2' : 'tronic',

The same could be done by splitting on value string, perhaps.

ghukill commented 6 years ago

This touches on a closed issue (as mentioned there as well): https://github.com/WSULib/combine/issues/44.

The problem there was "blocks" of elements that were related based on their nesting, e.g.:

<mods:subject>
    <mods:topic>Labor unions</mods:topic>
    <mods:geographic>Michigan</mods:geographic>
    <mods:geographic>Saginaw</mods:geographic>
    <mods:temporal>1800-1810</mods:temporal>
</mods:subject>
<mods:subject>
    <mods:topic>Strikes</mods:topic>
    <mods:geographic>Michigan</mods:geographic>
    <mods:geographic>Hillsdale</mods:geographic>
    <mods:temporal>1930-1940</mods:temporal>
</mods:subject>

In this example, it is the relationship of all siblings under a <mods:subject> that would be helpful to maintain.

A default XML2kvp parse, removing namespaces prefixes, would result in:

In [2]: XML2kvp.xml_to_kvp(test_xml, remove_ns_prefix=True)
Out[2]: 
{'root_subject_geographic': ('Michigan', 'Saginaw', 'Hillsdale'),
 'root_subject_temporal': ('1800-1810', '1930-1940'),
 'root_subject_topic': ('Labor unions', 'Strikes')}

The problem here is that these elements grouped under <mods:subject> are cherry picked to other fields, with little ability to relate them at a glance. Ideally, we could generate a new field, concatenating values from others, that would look something like:

{'root_subject':['Labor unions--Michigan--Saginaw--1800-1810', 'Strikes--Michigan--Hillsdale--1930-1940']} 

We lose the knowledge that Saginaw is geographic, or that 1800-1810 is temporal, but that particular string has value in other contexts, and we could keep those other, further parsed fields as well.

One thought has been to offer a spliting of a field when its values are multivalued (https://github.com/WSULib/combine/issues/222). If this were boolean for all, or an array of fields to split, you might get something like:

{'root_subject0_topic0': ('Labor unions'),
{'root_subject1_topic1': ('Strikes'),
{'root_subject0_geographic0': ('Michigan'),
{'root_subject1_geographic1': ('Michigan'),
{'root_subject0_geographic2': ('Saginaw'),
{'root_subject1_geographic3': ('Hillsdale'),
{'root_subject0_temporal0': ('1800-1810'),
{'root_subject1_temporal1': ('1930-1940'),

The trick would be to "collapse" these field names with indexes into something useful...