gbif / gbif-common

Utility classes
Apache License 2.0
1 stars 1 forks source link

CSVReader.Next() method isn't in conformance with RFC 4180 #1

Closed dvdscripter closed 7 years ago

dvdscripter commented 8 years ago

Reading a CSV file while using Next() will fail to read any field with \n inside. While is acceptable most softwares are using https://tools.ietf.org/html/rfc4180 recommendations.
LibreOffice and Excel seems to accept \n if field is quoted.
Also you should skip empty CSV lines instead of just when row.length() == 0 is true:

"",""

and

,

are valid empty lines with two fields.
Can you add support to this? IPT use this class to read input source data and some users are complaining.
I'm reporting here because another gbif tool can show the same behavior.

cgendreau commented 8 years ago

Hi @dvdscripter (sorry for the delay) , You are right CSVReader is not rfc4180 compliant and ideally it should.

I'm not sure if we will maintain CSVReader considering some libraries can handle that easily. Recently we added TabularFiles which simply wraps Super CSV. I`m not sure to which extent the IPT can use it. cc @kbraak

dvdscripter commented 8 years ago

Thanks @cgendreau, hope @kbraak can comment at this issue too. Anyway, thanks for taking time to look at this matter.

kbraak commented 8 years ago

Thanks David. I don't have anything more to add to what Christian explained. In case it helps, you can send users this FAQ explaining how the IPT supports multiline fields.

cgendreau commented 7 years ago

This is now considered fixed (in this project).

see test testCsvMultiline in TabularDataFileReaderTest around https://github.com/gbif/gbif-common/blob/master/src/test/java/org/gbif/utils/file/tabular/TabularDataFileReaderTest.java#L68