Closed attil-io closed 8 years ago
Can you explain more about what you're going for here? This is quite a big change to how the parsing works, with changes all over the code to enable this, for what appears to be clearly incorrect input. At this point, this is quite a bit more work than I feel willing to have the parser do in order to make sense of this type of field.
First of all, I absolutely agree the input is incorrect (it contains quotes within the field). The suggested change attempts to handle this incorrect input gracefully.
INPUT: "\"Hello\" \"Hello2\" \"Hello3\""
OUTPUT: " \"Hello2\" \"Hello3\""
I.e., the first part is lost. With the suggested change the first part is kept (but the surrounding quotes are not, as they are the 'real' ones, and thus not part of the field).
parse-csv-line
, during the first iteration, the quote is encountered, thus the condition (== look-ahead (int quote-char))
matches. As a result, fields remains empty, and last-field
will be the value between the first quotes (Hello
).:else
clause and read all the rest of the input (including the quotes), as if it were unquoted in read-unquoted-field
. Since we are not in strict mode, the quotes will be accepted as part of the field.\"Hello2\" \"Hello3\"
) is now placed into last-field
, thus the previous value is lost (note that at this point fields
is still empty).(== -1 look-ahead)
), and thus last-field
(i.e. \"Hello2\" \"Hello3\"
) is returned.Let's handle differently quoted fields in case of strict and non-strict (permissive) mode. In case of strict mode, the old way is used (i.e., read until next quote, as implemented in read-quoted-field
). For permissive mode, a new function is introduced: read-quoted-field-permissive
).
The important part of this function is, that for each quote we have to check the character following the quote as well, as in the following lines:
+ (== c quote-char)
+ (let [next-char (reader-peek reader 2)]
+ (if (or (== next-char delimiter) (== next-char -1) (eol-at-reader-pos? reader end-of-line 1)
In this way, we can determine whether the quote is a closing one (if it is followed by a delimiter (like a comma), an EOL or an EOF), or an (erroneously placed) quote within the input.
The methods for checking the current position (reader-peek
, lf-at-reader-pos?
, `crlf-at-reader-pos?,
custom-eol-at-reader-pos?,
eol-at-reader-pos?) have now overloaded versions with an
offsetparameter. These overloaded versions check for the specific condition at
current-position + offset(instead of
current-position``, as the base versions would do).
They are necessary to be able to check for EOL, EOF or delimiter after the quote character.
reader-peek
etc.), are useful in themselves as well, for possible later enhancements.Right, there is definitely a bug there, no question. It happens when an quoted field ends and isn't followed by a separator or eol. You've correctly identified the location that causes this. But I don't think that changing the parsing and parsing rules to make this malformed file parse is the right solution. That input should be throwing a format error. Fixing this should be a lot less code, but basically it involves changing the state machine in parse-csv-line so that seeing a delimiter is a requirement for parsing the next field. As it is now, seeing a delimiter simply skips the reader forward and the code everywhere else assumes that any delimiters have been skipped past and nothing more needs to be done.
So, if I understand correctly, the expected behaviour is the following?
(is (thrown? Exception (dorun (parse-csv "\"Hello\" \"Hello2\" \"Hello3\""))))
(Even in non-strict mode.)
Yes.
I changed the implementation to throw the exception in case of the invalid input. Would you like me to open a new pull request for that?
That'd be great, thanks.
On Saturday, January 2, 2016, attil-io notifications@github.com wrote:
I changed the implementation to throw the exception in case of the invalid input. Would you like me to open a new pull request for that?
— Reply to this email directly or view it on GitHub https://github.com/davidsantiago/clojure-csv/pull/31#issuecomment-168407469 .
Done: https://github.com/davidsantiago/clojure-csv/pull/32 (I had to close this one, otherwise github did not let me open the new one.)
Consider the following field to be parsed in non-strict mode, with standard EOL and separator:
"\"Hello\" \"Hello2\" \"Hello3\""
I.e., besides the starting and closing quotes, there are some additional (invalid!) quotes in the input field. Without the fix, the following result is received:
" \"Hello2\" \"Hello3\""
This is not consistent: the first part is always lost.
After this fix, the result will be:
"Hello\" \"Hello2\" \"Hello3"
That is, the opening and closing quotes are treated as "normal" ones, whilst those within the field are part of the value. (The last quote should be immediately followed by a separator, new line, or EOF, to be recognized as such.)
See also the new test cases.