AccelerationNet / cl-csv

A common lisp library providing easy csv reading and writing
Other
116 stars 22 forks source link

Parsing confusion in the presence of non-escaping backslashes #17

Closed dimitri closed 10 years ago

dimitri commented 10 years ago

Hi,

As reported in https://github.com/dimitri/pgloader/issues/80 towards the end, cl-csv fails to parse simple input when it contains unexpected escaping characters (not the whole escaping string) in the middle of a text field.

Here's a reduced test case:

"16417153","1401640227","Jun 1 2014","HTML -//W3C//DTD HTML 4.01 Frameset//EN\\",""

And I can reproduce the failure with the following code:

CL-USER> (with-open-file (s "foo.csv")
           (cl-csv:read-csv s :quote #\" :separator #\, :escape "\\\""))
; Evaluation aborted on #<SB-KERNEL:CASE-FAILURE expected-type:
                         (MEMBER :COLLECTING :COLLECTING-QUOTED :WAITING)
                         datum: :WAITING-FOR-NEXT>.
bobbysmith007 commented 10 years ago

So the bug as I understand it is the need to add an escape for the escape character (in some circumstances). By default this should be "\" (ie: two backslashes in a row). Any suggestion for the name? escape-escape sounds awful but also accurate.

dimitri commented 10 years ago

Well in the case of that specific input file you can see https://github.com/HTTPArchive/httparchive/issues/25 that hints into the backslash not being there for any reason really (truncated string).

So I'm not sure we should reason in terms of escaping the escape character rather than just allowing for a general espace character: backslash could be used to escape whatever follows, which in the case of the faulty input we have, is another backslash, and then we have a free quote, so the quoted section ends. What do you think?

bobbysmith007 commented 10 years ago

I read that as: We would like a new parser escaping mode, that rather than replacing all quote-escapes with a quote, replaces {escape-character}{thing} with {thing} regardless of what {thing} is.

I have to imagine that this is partly where the "" escape sequence arose.

I guess a new parameter :escaping-mode that defaults to :quote and accepts one of (:quote :following-char).

bobbysmith007 commented 10 years ago

Please try this out and let me know if it matches what you had in mind / solves your parsing error.