MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Need better handling of multivalued cells in CSV tabular export #366

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

The rough overview of the process is:

  1. XML is converted to python XML2kvp dictionary, kvp_dict
    • multiple values for a field are stored as list
    • some values are added while malleable python dictionary
  2. kvp_dict is dumped as JSON
    • list is preserved, but now as string, e.g. ["camper","beethoven"]
  3. stored in RDD kvp_rdd

One of the complexities of having n-number of JSON lines, is not knowing what overlapping columns may exist. To write CSV, we need to read this JSON into a DataFrame that can then be written. Unfortunately, using spark.read.json() over the kvp_rdd does not pickup multivalued data well, and will store a string previously multivalued information like ["camper","beethoven"].

One approach was to strip the brackets and replace the , delimiter with a user defined one like | to create something like:

campere|beethoven

However, and this should have been caught, this results in a situation like:

`["I, also, find olives delicious.", "Do you find olives delicious?"]`

as the incorrect multivalue of, as it picks up the commas embedded:

I|also|find olives delicious.|Do you find olives delicious?

This was performed by this line:

return regexp_replace(regexp_replace(column, '(^\[)|(\]$)|(")', ''), ",", "%s" % multivalue_delim)

which in retrospect, pretty crudely removes the brackets and quoted strings, and then replaces , delimiter with user defined multivalue_delim. Clearly, this won't work.

A couple of options:

  1. convert any python lists in kvp dictionary to strings before it leaves dictionary stage
  2. smarter regex fixing with spark functions

Downside to option number 2, would false positive fixes for somehing like: ["horse","buggy"], which while an odd metadata, should be a valid one. Believe this supports "flattening" to string before it leaves python dictionary.

ghukill commented 5 years ago

Done, through a combination of removing quoting/escaping, and converting lists embedded in dictionaries to strings.