Closed ghukill closed 5 years ago
Appears that upgrading to spark 2.2.x
will address this, as we could then pass multiLine=True
when reading/writing CSV:
https://issues.apache.org/jira/browse/SPARK-19610
Tested, fixed in Spark 2.3 when reading CSV.
Worried there might be potential executor memory problems with multiLine = True
, so going to add flag that will allow overriding this through XML2kvp configs.
Closing.
For CSV tabular exports, newlines found in XML result in newlines in the CSV that break rows and records. Strangely, this is easy to miss, as the re-import will complete successfully, but with additional, malformed Records. The amount of these Records will be equal to number of new lines found in corpus.
This feels more like a modification is needed for export, not import. The import must blindly look at newlines as new Records, but the export may effect what is written to CSV.