Closed ghukill closed 5 years ago
More information on the problem.
Target XML:
<mods:subject>
<mods:geographic>Florida</mods:geographic>
<mods:topic>History</mods:topic>
<mods:temporal>Huguenot colony, 1562-1565</mods:temporal>
<mods:topic>Fiction</mods:topic>
</mods:subject>
Results from CSV harvest to XML, note the comma for `
<mods:subject>
<mods:geographic>Florida</mods:geographic>
<mods:temporal>Huguenot colony, 1562-1565</mods:temporal>
<mods:topic>History,Fiction</mods:topic>
</mods:subject>
Similarly, even for JSON harvests, note the value still in list form:
<mods:subject>
<mods:temporal>Huguenot colony, 1562-1565</mods:temporal>
<mods:geographic>Florida</mods:geographic>
<mods:topic>["History","Fiction"]</mods:topic>
</mods:subject>
What this also plainly reveals, is that while siblings are maintained, order is not for siblings and possibly other elements. This will either be a limitation on the tabular --> XML harvest, or handled in this newly created issue: https://github.com/WSULib/combine/issues/364
The offending kvp dictionary looks like this, still in tuple form:
'mods|mods(e92b01)___mods|subject(97c301)___mods|topic(cea101)': ('History',
'Fiction'),
But the parsed JSON looks just about as good, in JSON array:
"mods|mods(575801)___mods|subject(97c301)___mods|topic(123301)": [
"History",
"Fiction"
],
In CSV/Excel, the value for mods|mods(575801)___mods|subject(97c301)___mods|topic(123301)
is:
History,Fiction
The heart of the problem is that when CSV or JSON are read in spark, the values are stored as strings:
In [24]: r['mods|mods(575801)___mods|subject(97c301)___mods|topic(123301)']
Out[24]: '["History","Fiction"]'
Solution has been to use ast.literal_eval
when incoming values might be from JSONlines and include list or tuples, and include a multivalue_delim
arg to xml2kvp that allows for an alternative to a comma ,
when splitting multivalued cells in CSV.
Adding the multivalue_delim
required forking and modifying the es2csv library, which will now serve as the install during server provisioning. This is also noted in upgrade notes. This modification allows a kibana-delimiter
argument that, when matched with the multivalue_delim
from xml2kvp, provides a fairly reliable roundtripping of data in JSON or CSV.
Closing this issue.
When parsing kvp --> XML, build in functionality to split values intended as multivalued
Relevant line: https://github.com/WSULib/combine/blob/spreadsheetharvest/core/xml2kvp.py#L777
And/or, consider handling multi-rowed? similar to OpenRefine?