MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

xml2kvp: split multivalued on delimiter #361

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

When parsing kvp --> XML, build in functionality to split values intended as multivalued

Relevant line: https://github.com/WSULib/combine/blob/spreadsheetharvest/core/xml2kvp.py#L777

And/or, consider handling multi-rowed? similar to OpenRefine?

ghukill commented 5 years ago

More information on the problem.

Target XML:

<mods:subject>
      <mods:geographic>Florida</mods:geographic>
      <mods:topic>History</mods:topic>
      <mods:temporal>Huguenot colony, 1562-1565</mods:temporal>
      <mods:topic>Fiction</mods:topic>
   </mods:subject>

Results from CSV harvest to XML, note the comma for ` elements:

<mods:subject>
    <mods:geographic>Florida</mods:geographic>
    <mods:temporal>Huguenot colony, 1562-1565</mods:temporal>
    <mods:topic>History,Fiction</mods:topic>
  </mods:subject>

Similarly, even for JSON harvests, note the value still in list form:

<mods:subject>
    <mods:temporal>Huguenot colony, 1562-1565</mods:temporal>
    <mods:geographic>Florida</mods:geographic>
    <mods:topic>["History","Fiction"]</mods:topic>
  </mods:subject>
ghukill commented 5 years ago

What this also plainly reveals, is that while siblings are maintained, order is not for siblings and possibly other elements. This will either be a limitation on the tabular --> XML harvest, or handled in this newly created issue: https://github.com/WSULib/combine/issues/364

ghukill commented 5 years ago

The offending kvp dictionary looks like this, still in tuple form:

'mods|mods(e92b01)___mods|subject(97c301)___mods|topic(cea101)': ('History',
  'Fiction'),

But the parsed JSON looks just about as good, in JSON array:

 "mods|mods(575801)___mods|subject(97c301)___mods|topic(123301)": [
        "History",
        "Fiction"
    ],

In CSV/Excel, the value for mods|mods(575801)___mods|subject(97c301)___mods|topic(123301) is:

History,Fiction

The heart of the problem is that when CSV or JSON are read in spark, the values are stored as strings:

In [24]: r['mods|mods(575801)___mods|subject(97c301)___mods|topic(123301)']
Out[24]: '["History","Fiction"]'
ghukill commented 5 years ago

Solution has been to use ast.literal_eval when incoming values might be from JSONlines and include list or tuples, and include a multivalue_delim arg to xml2kvp that allows for an alternative to a comma , when splitting multivalued cells in CSV.

Adding the multivalue_delim required forking and modifying the es2csv library, which will now serve as the install during server provisioning. This is also noted in upgrade notes. This modification allows a kibana-delimiter argument that, when matched with the multivalue_delim from xml2kvp, provides a fairly reliable roundtripping of data in JSON or CSV.

Closing this issue.