Closed jonfroehlich closed 5 years ago
I might also be able to solve this by passing a special converter
to the Pandas read_csv
method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html.
OK, ugh, the note
field is messing up lots of other row parsing as well:
Again, the easiest fix, I think, would be to escape this column somehow (and typically this is done with a quotation but then we'd have to make sure that if the note field actually contained a quotation, that this was properly escaped).
Aha, in fact, by rfc4180 standards, fields that contain embedded commas should be escaped! From the standard:
Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example:
aaa,bbb,ccc
Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF zzz,yyy,xxx
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
"aaa","b""bb","ccc"
I'm going to stop working on this now until I hear back from @misaugstad with his thoughts... :)
okay @jonfroehlich I've sent you a new CSV of the interactions table where all the columns are quoted. I might just quote the notes column next time to reduce the size of the table any amount I can. I don't think we use any quotes within the notes field, so I didn't go through the process of figuring out how to escape those quotes... But let me know if you find any that cause a problem!
Wow, this worked great. Like amazing! I agree with you that we could likely just have quotes around the notes field in the future...
Great!
I've parsed a fair number of complicated CSV files in my life :) and typically, in my experience, when a field (aka a cell) includes data like a
,
or even a\n
embedded within it, that these are escaped by quotation marks. Indeed, Wikipedia states:If possible for future SQL dumps, we need to turn on a flag that surrounds each CSV cell with a quotation so that we can more easily parse out rows like this one that includes a cell:
148,1,LabelingCanvas_FinishLabeling,stxXyCKAbd73DmkM2vsIHA,38.9404983520508,-77.0676193237305,262,-6,1,labelType:CurbRamp,canvasX:362,canvasY:280,2019-03-04 11:59:59.576-08,1,1
In this case, it's the
note
field that has an embedded comma:labelType:CurbRamp,canvasX:362,canvasY:280
:So, imo, we should surround each cell by default with a quotation to deal with this--so the above row would be:
"148","1","LabelingCanvas_FinishLabeling","stxXyCKAbd73DmkM2vsIHA,38.9404983520508","-77.0676193237305","262","-6","1","labelType:CurbRamp,canvasX:362,canvasY:280","2019-03-04 11:59:59.576-08","1","1"
Perhaps we only need to escape the
note
field?I might be wrong about this specific example, however, as I'm still getting used to our log file format:
audit_task_interaction_id, audit_task_id, action, gsv_panorama_id, lat, lng, heading, pitch, zoom, note, timestamp, temporary_label_id, mission_id