ProjectSidewalk / sidewalk-quality-analysis

An analysis of Project Sidewalk user quality based on interaction logs
5 stars 3 forks source link

Future SQL Dumps Should Turn on Quotations for CSV #3

Closed jonfroehlich closed 5 years ago

jonfroehlich commented 5 years ago

I've parsed a fair number of complicated CSV files in my life :) and typically, in my experience, when a field (aka a cell) includes data like a , or even a \n embedded within it, that these are escaped by quotation marks. Indeed, Wikipedia states:

The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field.

If possible for future SQL dumps, we need to turn on a flag that surrounds each CSV cell with a quotation so that we can more easily parse out rows like this one that includes a cell: 148,1,LabelingCanvas_FinishLabeling,stxXyCKAbd73DmkM2vsIHA,38.9404983520508,-77.0676193237305,262,-6,1,labelType:CurbRamp,canvasX:362,canvasY:280,2019-03-04 11:59:59.576-08,1,1

In this case, it's the note field that has an embedded comma: labelType:CurbRamp,canvasX:362,canvasY:280:

So, imo, we should surround each cell by default with a quotation to deal with this--so the above row would be: "148","1","LabelingCanvas_FinishLabeling","stxXyCKAbd73DmkM2vsIHA,38.9404983520508","-77.0676193237305","262","-6","1","labelType:CurbRamp,canvasX:362,canvasY:280","2019-03-04 11:59:59.576-08","1","1"

Perhaps we only need to escape the note field?

I might be wrong about this specific example, however, as I'm still getting used to our log file format: audit_task_interaction_id, audit_task_id, action, gsv_panorama_id, lat, lng, heading, pitch, zoom, note, timestamp, temporary_label_id, mission_id

jonfroehlich commented 5 years ago

I might also be able to solve this by passing a special converter to the Pandas read_csv method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html.

jonfroehlich commented 5 years ago

OK, ugh, the note field is messing up lots of other row parsing as well:

image

Again, the easiest fix, I think, would be to escape this column somehow (and typically this is done with a quotation but then we'd have to make sure that if the note field actually contained a quotation, that this was properly escaped).

jonfroehlich commented 5 years ago

Aha, in fact, by rfc4180 standards, fields that contain embedded commas should be escaped! From the standard:

  1. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example:

    aaa,bbb,ccc

  2. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:

    "aaa","bbb","ccc" CRLF zzz,yyy,xxx

  3. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

    "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx

  4. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

    "aaa","b""bb","ccc"

jonfroehlich commented 5 years ago

I'm going to stop working on this now until I hear back from @misaugstad with his thoughts... :)

misaugstad commented 5 years ago

okay @jonfroehlich I've sent you a new CSV of the interactions table where all the columns are quoted. I might just quote the notes column next time to reduce the size of the table any amount I can. I don't think we use any quotes within the notes field, so I didn't go through the process of figuring out how to escape those quotes... But let me know if you find any that cause a problem!

jonfroehlich commented 5 years ago

Wow, this worked great. Like amazing! I agree with you that we could likely just have quotes around the notes field in the future...

image

misaugstad commented 5 years ago

Great!