XML is converted to python XML2kvp dictionary, kvp_dict
multiple values for a field are stored as list
some values are added while malleable python dictionary
kvp_dict is dumped as JSON
list is preserved, but now as string, e.g. ["camper","beethoven"]
stored in RDD kvp_rdd
One of the complexities of having n-number of JSON lines, is not knowing what overlapping columns may exist. To write CSV, we need to read this JSON into a DataFrame that can then be written. Unfortunately, using spark.read.json() over the kvp_rdd does not pickup multivalued data well, and will store a string previously multivalued information like ["camper","beethoven"].
One approach was to strip the brackets and replace the , delimiter with a user defined one like | to create something like:
campere|beethoven
However, and this should have been caught, this results in a situation like:
`["I, also, find olives delicious.", "Do you find olives delicious?"]`
as the incorrect multivalue of, as it picks up the commas embedded:
I|also|find olives delicious.|Do you find olives delicious?
which in retrospect, pretty crudely removes the brackets and quoted strings, and then replaces , delimiter with user defined multivalue_delim. Clearly, this won't work.
A couple of options:
convert any python lists in kvp dictionary to strings before it leaves dictionary stage
smarter regex fixing with spark functions
Downside to option number 2, would false positive fixes for somehing like: ["horse","buggy"], which while an odd metadata, should be a valid one. Believe this supports "flattening" to string before it leaves python dictionary.
The rough overview of the process is:
kvp_dict
kvp_dict
is dumped as JSON["camper","beethoven"]
kvp_rdd
One of the complexities of having n-number of JSON lines, is not knowing what overlapping columns may exist. To write CSV, we need to read this JSON into a DataFrame that can then be written. Unfortunately, using
spark.read.json()
over thekvp_rdd
does not pickup multivalued data well, and will store a string previously multivalued information like["camper","beethoven"]
.One approach was to strip the brackets and replace the
,
delimiter with a user defined one like|
to create something like:However, and this should have been caught, this results in a situation like:
as the incorrect multivalue of, as it picks up the commas embedded:
This was performed by this line:
which in retrospect, pretty crudely removes the brackets and quoted strings, and then replaces
,
delimiter with user definedmultivalue_delim
. Clearly, this won't work.A couple of options:
Downside to option number 2, would false positive fixes for somehing like:
["horse","buggy"]
, which while an odd metadata, should be a valid one. Believe this supports "flattening" to string before it leaves python dictionary.