digital-preservation / csv-validator

CSV Validation Tool and API (CSV Schema RI)
http://digital-preservation.github.io/csv-validator
Mozilla Public License 2.0
205 stars 55 forks source link

Handle carriage returns/new line characters in cells #512

Open techncl opened 3 weeks ago

techncl commented 3 weeks ago

We are treating all carriage returns/new lines characters (\r or \n) as the end of a row, even if they are in a cell of a row; could we use a CSV library that handles carriage returns/new line characters and/or properly quotes the cells for us?

sparkhi commented 3 weeks ago

apache-commons or univocity springs to mind

DavidUnderdown commented 3 weeks ago

Does that not only happen on the initial quick check of the number of rows to process? I was pretty sure that in the actual parsing and validation line breaks within fields are handled correctly, but you do sometimes see a mismatch in the number of lines it says it has to process v what it actually does. For example you might find that it says there are a 1000 rows to process in the CSV file, but actually it finishes saying 998 of 1000 rows processed because two rows had a line break within a field.

steve-daly commented 3 weeks ago

Yes, I think CSV Validator can handle carriage returns within cells, it just would benefit from better reporting of line counts/numbers as David says. I think it uses Univocity already CSV processing