KitWallace / treemap

3 stars 2 forks source link

newlines in CSV fields #150

Closed KitWallace closed 6 years ago

KitWallace commented 6 years ago

CSV files generated from BCC data contain newlines embedded in fields. A simple tokenising of the CSV string in XQuery introduces spurious line breaks so the sting must be pre-processed to replace nls in quoted strings before tokenisation

we should be able to do this with a regexp along these lines (from PHP)

$parsedCSV = preg_replace('/(,|\n|^)"(?:([^\n"])\n([^\n"]))*"/', '$1"$2 $3"', $parsedCSV);

KitWallace commented 6 years ago

This python script does it

! /usr/bin/env python

import csv import sys csv_reader = csv.reader(sys.stdin) h_outfile = sys.stdout

for row in csv_reader: row = ",".join(row) row = row.replace('\n', ' ').replace('\r', ' ') h_outfile.write("%s\n" % (row)) h_outfile.flush()

print row

in /chris/trees/no_nl.py
based on a more comprehensive version

http://blog.nguyenvq.com/blog/2014/08/07/change-delimiter-in-a-csv-file-and-remove-line-breaks-in-fields