davidsantiago / clojure-csv

A library for reading and writing CSV files from Clojure
187 stars 35 forks source link

New Line Behavior #23

Closed jmacias closed 10 years ago

jmacias commented 10 years ago

I found a possible issue with following parse-csv function, the line-seq behavior is the same behavior of java, is there a reason why the parse-csv does not behave the same way?

 (clojure-csv.core/parse-csv "test1,test2\rtest3,test4")
 ;>>(["test1" "test2\rtest3" "test4"])

Does not follow the same behavior than:

(line-seq (java.io.BufferedReader. (java.io.StringReader. "test1,test2\rtest3,test4")))
;>> ("test1,test2" "test3,test4")

The reason is that line-seq uses a BufferedReader to readLine where a line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed. And parse-csv only consider \n and \r as a new line.

http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#readLine()

davidsantiago commented 10 years ago

This is not an error. The behavior of BufferedReader.readLine() is not a useful guide to the CSV format. RFC 4180, which is about as close as things get to a standard, specifies CRLF line terminators in CSV files. Clojure-CSV accepts CRLF or just LF, which is also a common line terminator. If you need some other weird character as your end-of-line, then set the :end-of-line option to that character when you parse.

jmacias commented 10 years ago

Thanks @davidsantiago for the clarification.

Hope you can help me with another question. I have the use case where I need to process CSV files from different sources, I've found that sometimes when files comes from a Mac running MS Excel (OS 9 or Mac OS X running MS Excel 2011) they use '\r' as return line.

I was thinking on using clojure line-seq clojure.java.io/reader and then just parse each line with clojure clojure-csv.core/parse-csv . What would be your suggestion to handle this files?

Thanks in advance David!

davidsantiago commented 10 years ago

You can't parse a csv file line by line, as csv fields can contain line separators. You need to fully parse the cvs to even know which are line breaks and which are in the data. You should use parse-csv with the :end-of-line option set to "\r" for those files.

David

On Sunday, May 11, 2014, Juan Macias notifications@github.com wrote:

Thanks @davidsantiago https://github.com/davidsantiago for the clarification.

Hope you can help me with another question. I have the use case where I need to process CSV files from different sources, I've found that sometimes when files comes from a Mac running MS Excel (OS 9 or Mac OS X running MS Excel 2011) they use '\r' as return line.

I was thinking on using clojure line-seq clojure.java.io/reader and then just parse each line with clojure clojure-csv.core/parse-csv . What would be your suggestion to handle this files?

Thanks in advance David!

— Reply to this email directly or view it on GitHubhttps://github.com/davidsantiago/clojure-csv/issues/23#issuecomment-42785736 .