MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

tabular export/import, newlines are breaking in CSV #365

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

For CSV tabular exports, newlines found in XML result in newlines in the CSV that break rows and records. Strangely, this is easy to miss, as the re-import will complete successfully, but with additional, malformed Records. The amount of these Records will be equal to number of new lines found in corpus.

This feels more like a modification is needed for export, not import. The import must blindly look at newlines as new Records, but the export may effect what is written to CSV.

ghukill commented 5 years ago

Appears that upgrading to spark 2.2.x will address this, as we could then pass multiLine=True when reading/writing CSV: https://issues.apache.org/jira/browse/SPARK-19610

ghukill commented 5 years ago

Tested, fixed in Spark 2.3 when reading CSV.

Worried there might be potential executor memory problems with multiLine = True, so going to add flag that will allow overriding this through XML2kvp configs.

Closing.