cldf / csvw

CSV on the web
Apache License 2.0
36 stars 6 forks source link

csvw read will continue reading to end of file or field limit on unclosed quote #52

Closed SimonGreenhill closed 3 years ago

SimonGreenhill commented 3 years ago

If a CSV file contains a quoted field with an unclosed quote e.g.:

Row,Content
1,"i am an unclosed comment
2,Lorem ipsum dolor sit amet consectetur adipiscing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum.

...the reader will combine all subsequent content until the end of file or the field limit, whichever comes first. That is, row 1, column 'Content` becomes

"i am an unclosed comment\n2,Lorem ipsum dol...

either generating a very large field, or raising:

_csv.Error: field larger than field limit (131072)

This is an issue with a malformed csv file, and I don't see an easy way to solve this, so this issue is primarily to document the problem for anyone else who comes across it.

SimonGreenhill commented 3 years ago

for the record, the file I noticed this in had a damn smart quote intended as the end quote, which of course is invalid, but damn hard to debug:

1,"i am an unclosed comment”
xrotwang commented 3 years ago

I'd still say we do the right thing - i.e. falling back to what python's csv module does. So rather than introducing dubious heuristics to infer that something fishy is going on (like pandas or R might do :) ), I'd keep this as is.

xrotwang commented 3 years ago

I think that problem is inherent in storing data in text files (same with NEXUS, etc.): As soon as you start to use the format also for non-traditional-tabular-content - such as full gene sequences or images serialized as data URLs - and violate the assumption that table cells will be small, the advantage of text - namely that it can be inspected by looking at - breaks down.

SimonGreenhill commented 3 years ago

yeah, I can only think of very brittle ways to fix this, so let's wontfix for now.

xrotwang commented 3 years ago

I think the proper way to fix this is csvw (or similar): add metadata to your csv to inform the parsing.

Simon J Greenhill notifications@github.com schrieb am Fr., 5. Feb. 2021, 09:29:

Closed #52 https://github.com/cldf/csvw/issues/52.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cldf/csvw/issues/52#event-4294908608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKG2JGDAIBV3Q67RQA3S5OT5JANCNFSM4XDO4C3Q .