davidsantiago / clojure-csv

A library for reading and writing CSV files from Clojure
187 stars 35 forks source link

Fix for quote within quoted fields issue. #31

Closed attil-io closed 8 years ago

attil-io commented 8 years ago

Consider the following field to be parsed in non-strict mode, with standard EOL and separator:

"\"Hello\" \"Hello2\" \"Hello3\""

I.e., besides the starting and closing quotes, there are some additional (invalid!) quotes in the input field. Without the fix, the following result is received:

" \"Hello2\" \"Hello3\""

This is not consistent: the first part is always lost.

After this fix, the result will be:

"Hello\" \"Hello2\" \"Hello3"

That is, the opening and closing quotes are treated as "normal" ones, whilst those within the field are part of the value. (The last quote should be immediately followed by a separator, new line, or EOF, to be recognized as such.)

See also the new test cases.

davidsantiago commented 8 years ago

Can you explain more about what you're going for here? This is quite a big change to how the parsing works, with changes all over the code to enable this, for what appears to be clearly incorrect input. At this point, this is quite a bit more work than I feel willing to have the parser do in order to make sense of this type of field.

attil-io commented 8 years ago

First of all, I absolutely agree the input is incorrect (it contains quotes within the field). The suggested change attempts to handle this incorrect input gracefully.

Current situation

INPUT: "\"Hello\" \"Hello2\" \"Hello3\"" OUTPUT: " \"Hello2\" \"Hello3\""

I.e., the first part is lost. With the suggested change the first part is kept (but the surrounding quotes are not, as they are the 'real' ones, and thus not part of the field).

Reason

  1. In parse-csv-line, during the first iteration, the quote is encountered, thus the condition (== look-ahead (int quote-char)) matches. As a result, fields remains empty, and last-field will be the value between the first quotes (Hello).
  2. In the second iteration, we now get to the :else clause and read all the rest of the input (including the quotes), as if it were unquoted in read-unquoted-field. Since we are not in strict mode, the quotes will be accepted as part of the field.
  3. The just read part (\"Hello2\" \"Hello3\") is now placed into last-field, thus the previous value is lost (note that at this point fields is still empty).
  4. We arrive at the end of the string ((== -1 look-ahead)), and thus last-field (i.e. \"Hello2\" \"Hello3\") is returned.

Proposed change

Let's handle differently quoted fields in case of strict and non-strict (permissive) mode. In case of strict mode, the old way is used (i.e., read until next quote, as implemented in read-quoted-field). For permissive mode, a new function is introduced: read-quoted-field-permissive). The important part of this function is, that for each quote we have to check the character following the quote as well, as in the following lines:

+            (== c quote-char)
+                (let [next-char (reader-peek reader 2)]
+                     (if (or (== next-char delimiter) (== next-char -1) (eol-at-reader-pos? reader end-of-line 1)

In this way, we can determine whether the quote is a closing one (if it is followed by a delimiter (like a comma), an EOL or an EOF), or an (erroneously placed) quote within the input.

Changes to other methods

The methods for checking the current position (reader-peek, lf-at-reader-pos?, `crlf-at-reader-pos?,custom-eol-at-reader-pos?,eol-at-reader-pos?) have now overloaded versions with anoffsetparameter. These overloaded versions check for the specific condition atcurrent-position + offset(instead ofcurrent-position``, as the base versions would do). They are necessary to be able to check for EOL, EOF or delimiter after the quote character.

Some more thoughts

  1. Maybe there is a simpler solution to achieve this (e.g. read-quoted-field-permissive could be merged into read-quoted-field).
  2. I feel that the changes to the methods for checking current position (reader-peek etc.), are useful in themselves as well, for possible later enhancements.
  3. I'm quite new at clojure, so it might be that I did not always follow the best practices ;)
davidsantiago commented 8 years ago

Right, there is definitely a bug there, no question. It happens when an quoted field ends and isn't followed by a separator or eol. You've correctly identified the location that causes this. But I don't think that changing the parsing and parsing rules to make this malformed file parse is the right solution. That input should be throwing a format error. Fixing this should be a lot less code, but basically it involves changing the state machine in parse-csv-line so that seeing a delimiter is a requirement for parsing the next field. As it is now, seeing a delimiter simply skips the reader forward and the code everywhere else assumes that any delimiters have been skipped past and nothing more needs to be done.

attil-io commented 8 years ago

So, if I understand correctly, the expected behaviour is the following?

(is (thrown? Exception (dorun (parse-csv "\"Hello\" \"Hello2\" \"Hello3\""))))

(Even in non-strict mode.)

davidsantiago commented 8 years ago

Yes.

attil-io commented 8 years ago

I changed the implementation to throw the exception in case of the invalid input. Would you like me to open a new pull request for that?

davidsantiago commented 8 years ago

That'd be great, thanks.

On Saturday, January 2, 2016, attil-io notifications@github.com wrote:

I changed the implementation to throw the exception in case of the invalid input. Would you like me to open a new pull request for that?

— Reply to this email directly or view it on GitHub https://github.com/davidsantiago/clojure-csv/pull/31#issuecomment-168407469 .

attil-io commented 8 years ago

Done: https://github.com/davidsantiago/clojure-csv/pull/32 (I had to close this one, otherwise github did not let me open the new one.)