Data-Liberation-Front / csvlint.rb

The gem behind http://csvlint.io
MIT License
286 stars 87 forks source link

Quoted CRs in CRLF file incorrectly result in validation errors #198

Open andylolz opened 7 years ago

andylolz commented 7 years ago

So e.g.

"Foo","Bsr","Baz"^M
"Biff","Baff","Boff
"^M
"Qux","Teaspoon","Doge"^M

…results in:

$ csvlint test.csv
.!.
test.csv is INVALID
1. unclosed_quote. Row: 2. "Biff","Baff","Boff
"

According to the RFC 4180, though, I think this should be fine.

andylolz commented 7 years ago

By default, the type of line endings used are auto-detected: https://github.com/theodi/csvlint.rb/blob/873617f5/lib/csvlint/validate.rb#L302

…but the CSV is parsed row-by-row: https://github.com/theodi/csvlint.rb/blob/873617f5/lib/csvlint/validate.rb#L182

So in the case described above, auto-detect gets it wrong:

irb(main):001:0> require 'csv'
=> true
irb(main):002:0> # load the data above
irb(main):003:0* contents = ["Foo,Bsr,Baz\r\n", "Biff,Baff,\"Boff\n", "\"\r\n", "Qux,Teaspoon,Doge\r\n"]
=> ["Foo,Bsr,Baz\r\n", "Biff,Baff,\"Boff\n", "\"\r\n", "Qux,Teaspoon,Doge\r\n"]
irb(main):004:0> 
irb(main):005:0* # try to parse the second row, auto-detecting line endings
irb(main):006:0* CSV.parse_line(contents[1] + contents[2], {:row_sep => :auto})
CSV::MalformedCSVError: Unclosed quoted field on line 1.
  from …/csv.rb:1872:in `block in shift'
  from …/csv.rb:1779:in `loop'
  from …/csv.rb:1779:in `shift'
  from …/csv.rb:1310:in `parse_line'
  from (irb):6
  from …/irb:11:in `<main>'

…But if you manually specify the line endings, everything works correctly:

irb(main):007:0* # try to parse the second row, manually specifying '\r\n' line endings
irb(main):008:0* CSV.parse_line(contents[1] + contents[2], {:row_sep => '\r\n'})
=> ["Biff", "Baff", "Boff\n"]
andylolz commented 7 years ago

I guess the solution would be to set @csv_options[:row_sep] to @line_breaks or something like that.

Floppy commented 7 years ago

Ah yes, @jenit and I were discussing this the other day, and we wondered if it was because of mixed line breaks as you've found. I think your solution will probably work; I need to check what the line-break behaviour inside quotes should be as well.

andylolz commented 7 years ago

I realised I should be explicit that parsing the whole thing with auto-detect line endings switched on works fine:

irb(main):009:0* CSV.parse(contents.join, {:row_sep => :auto})
=> [["Foo", "Bsr", "Baz"], ["Biff", "Baff", "Boff\n"], ["Qux", "Teaspoon", "Doge"]]

So this is only an issue when parsing row-by-row.

kspurgin commented 1 year ago

I have run into this same issue or a very similar one. Behavior of Csvlint and stdlib CSV parse documented here. (Test structure described here)

kspurgin commented 1 year ago

The contributing guidelines say: "Before you start coding, please reach out to us either on our gitter channel or by tagging a repository administrator on the issue ticket you are interested in contributing towards to indicate your interest in helping."

I am not finding a list of the repository admins, and apparently I need push access to a repo to get that info from GitHub API, so I'm making assumptions based on who has merged all recent PRs and tagging @Floppy here to indicate that I'd like to attempt a fix for this issue.

Rationale/use case/background:

We use csvlint in a batch data import application to validate user-generated CSVs before allowing import of data in the files. Using csvlint to prevent the import of CSVs with issues like ragged rows, invalid encoding, and stray/unclosed quotes is important to prevent clients from inadvertently loading malformed data.

This issue is currently blocking some of our clients from importing valid CSVs with mixed EOL characters inside quoted values (an artifact of original data source, and what OS they used to prepare the data, apparently). If you export data from our application as CSV on a Windows machine for roundtripping via our import tool, you get mixed EOL characters and can't re-ingest the data because csvlint flags the CSV as invalid. Our users are primarily collection managers in GLAM institutions, not data wrangling/tech savvy people; they are going to tune out/give up if anyone starts explaining CR, LF, vs CRLF to them.

Mixed EOL characters are handled fine by the standard Ruby CSV library, and are not prohibited by RFC 4180 according to my reading, so csvlint flagging these as invalid needs to be fixed.

kspurgin commented 1 year ago

Hey, I solved this in two different ways, which can be examined here: https://github.com/kspurgin/csvlint.rb/pulls

Everything in the features directory is identical for both PRs. Only lib/csvlint/validate.rb differs.

I prefer PR2 as it feels more expressive of the actual problem, and less hacky than the other.

From what I can tell, the root of the problem is: once we are at Validate.parse_contents, we have already gotten the item of interest down to a single line of the CSV. Csvlint::Validator.parse_line already did a great job of that (handling quoted EOL chars fine), and now we can tell what characters are at the end of this line. But we then call LineCSV.parse_line on it, passing the row_sep: :auto option in many cases. LineCSV subclasses CSV, whose Parser does a terrible job determining the proper row separator from a single line if there are EOL characters inside quotes.

PR2 says "Ok, whether the correct EOL char (according to stated dialect) was used, or whether there were mixed EOL chars is all handled elsewhere. If row_sep is set to something specific, I will continue to pass @csv_options to LineCSV.parse_line, because it might specify something real weird. If it's set to :auto, however, we expect it to be LF, CR, or CRLF, and we can easily determine which of those terminates this line, so we set that as the line-specific row_sep passed to LineCSV with the line, so CSV::Parser doesn't have to guess.

If you let me know which of these is acceptable (if either is), I'll do a real PR with better commit messages.