Open ajtucker opened 1 year ago
Elsewhere we've tended to use commons-csv rather than opencsv, partly because it's included in jena. It may make sense to switch to that for dclib, though I've no idea how it handles non-standard line breaks in files.
Certainly support CR
CR
LF
as a legal sequence for a single line break makes no sense to me, that's just mad.
Legal line endings are normally CR
(old style Macs), CR LF
(window) and of course LF
so interpreting CR CR LF
as two line breaks (albit in different styles) seems like the best interpretation. Preprocessing such broken files is the obvious solution and strikes me as better than bending the interpretation of CSVs.
Though, if you can provide a small test case then I guess we could check commons CSV and updated open CSV in case they have different behaviour.
Legal line endings are normally CR (old style Macs), CR LF (window) and of course LF so interpreting CR CR LF as two line breaks (albit in different styles) seems like the best interpretation.
Totally agree.
Interestingly (maybe) the text/csv
media type registration registration states:
Encoding considerations: CSV MIME entities consist of binary data
[RFC6838]. As per section 4.1.1. of RFC 2046 [RFC2046], this
media type uses CRLF to denote line breaks. However, implementers
should be aware that some implementations may use other values.
which is common at least with text/plain
. I was blissfully unaware of this until @ajtucker pointed it out.
I guess a forensic reading would note that the absence of the word 'only' and the note to implementers do not prohibit the use of other line breaks (specifically a single LF or CR).
While platform line endings are indeed an issue, the crux of this particular issue is that we were surprised that the parser essentially strips any CR
s from quoted cell values. We were expecting to be able to deal with these odd character sequences in a template, but they didn't make it that far.
The grammar in RFC4180 allows a quoted cell value to have CR
and LF
chars and I'd expect them to make it through unmolested.
When digging down, we saw that [BufferedReader](https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/BufferedReader.html#readLine()), to this day, reads lines and gobbles up any CR
, LF
or CR
LF
sequences, and the opencsv
parser will then replace them with a single LF
. More recent versions of opencsv
appear to have a keepCarriageReturn(bool)
method.
I've done some quick checking and it looks as though commons-csv:1.0
at least allows CR
s to come through.
import au.com.bytecode.opencsv.{CSVParser, CSVReader}
import org.apache.commons.csv.CSVFormat
import java.io.StringReader
import scala.jdk.CollectionConverters.IterableHasAsScala
object CSVParserTest extends Examples with App {
val acc_records = CSVFormat.RFC4180.parse(new StringReader(eg1)).asScala.toArray
assert(acc_records(1).get(0) == "line1\r\nline2")
val oc_reader = new CSVReader(
new StringReader(eg1),
CSVParser.DEFAULT_SEPARATOR, CSVParser.DEFAULT_QUOTE_CHARACTER, '\u0000')
val oc_records = oc_reader.readAll().asScala.toArray
assert(oc_records(1)(0) == "line1\nline2")
}
trait Examples {
val eg1: String = "a,b\r\n\"line1\r\nline2\",1"
}
Well very few sources of CSV conform to other aspects of RFC4180 (hence all parsers having lots of configuration options to try to cope with the vagaries of CSVs), why should this be different? :)
In particular, all CSV tools will normally cope with, and typically generate, platform-compliant non-quoted line endings not RFC4180 CR LF
. So similar handling of quoted line endings seems reasonable and expected to me unless running in a strict RFC4180 mode which we would never normally do.
I still think this specific case is simply broken data that should be cleaned by preprocessing rather than cleaned by text processing in the dclib rules, feels like a better separation of concerns.
However, if this is causing problems then I'd be OK to switch dclib to commons-csv and test whether using RFC4180 mode would break too many other things or, more likely, make it an option.
We've had a few source CSV files with character sequence
CR
CR
LF
inside cells, used to denote a single line break. The resulting sequence indclib
— i.e. when dealing with values in templates — isLF
LF
, essentially doubling the line breaks. This becomes a problem when users are trying to separate paragraphs in a description, for instance, where the usual practice is to use a double line break to separate paragraphs, which comes out as 4LF
characters and often gets turned into two<br />
s in the resulting HTML.@skwlilac and I tracked this down to the CSV parser
opencsv
(version 2.3) which underneath uses Java'sBufferedReader
to iterate over lines of text, where lines are delimited by eitherCR
,LF
, orCR
LF
. As far as the CSV parser is concerned, if a line ends in the middle of a quoted cell, then it adds back aLF
and continues reading the value.This dependency comes from
lib
version 2.0.0, which in turn comes fromappbase
2.0.0.opencsv
looks to have had a number of forks and owners over the intervening years, but does now say that it deals properly withCR
characters in valuesWhile this character sequence shouldn't be being used in the first place, I'd argue that the parser shouldn't be interpreting characters in cell values and should pass things through verbatim for processing within templates.