ben-strasser / fast-cpp-csv-parser

fast-cpp-csv-parser
BSD 3-Clause "New" or "Revised" License
2.11k stars 440 forks source link

allow '\0' '\n' in quoted columns #91

Open orasraf opened 4 years ago

orasraf commented 4 years ago
  1. main addition is templating LineReader to allow it the following functionality: next_line() will stop on the first end_line char it finds. this is not ideal since quoted column may include a line_end char which will make the new_line() return prematurely. in addition, find_next_column_end() will return prematurely as will if it encounters a null-char inside a quoted column. my usage of this parser happens to include both end-line and null chars (and other non-ascii chars) inside quoted columns and I had to add those changes to make it work. I added a template argument [null_terminated] to the double-quote policy that by default will make it behave the same as before the change. but if changed to 'false', it will allow '\0' inside quoted column instead of throwing exception. 2. moved no_quote_escape policy up (compiler warning). 3. cast 2 methods to int to match return type. (also compiler warning)
ben-strasser commented 4 years ago

If I understand the code correctly, then you do no do any Windows- to Linux-newline conversion inside quoted strings. This seems like a problem in the making.

Suppose you have an application that writes \n and expects \n when reading it in. Next somebody open the csv file under Windows using the wrong editor, change a single digit somewhere, and saves again. Now all Linux newlines got changed to Windows \r\n newlines. When reading the CSV file back in again, your program fails as it expects \n.

The typical Windows user is not aware of the \r\n and \n distinction. All he sees is your application failing. Now you get a cryptic "I change one number and the program broke!" bug report without any mention of Windows nor the editor used. I do not want to debug that.

Not allowing newlines in quoted seems bad at first, but it does prevent people from running into the problem above.

If the library supported automatic \r\n translation in strings, then I am certain that somebody would come with a usecase where this is the wrong behavior. This means that there should be a policy to do newline translation where \r\n -> \n translation should be the default.