epimorphics / dclib

Data Conversion library
Apache License 2.0
2 stars 2 forks source link

Carriage Return 0x0d rewritten in cell values #50

Open ajtucker opened 1 year ago

ajtucker commented 1 year ago

We've had a few source CSV files with character sequence CR CR LF inside cells, used to denote a single line break. The resulting sequence in dclib — i.e. when dealing with values in templates — is LF LF, essentially doubling the line breaks. This becomes a problem when users are trying to separate paragraphs in a description, for instance, where the usual practice is to use a double line break to separate paragraphs, which comes out as 4 LF characters and often gets turned into two <br />s in the resulting HTML.

@skwlilac and I tracked this down to the CSV parser opencsv (version 2.3) which underneath uses Java's BufferedReader to iterate over lines of text, where lines are delimited by either CR, LF, or CR LF. As far as the CSV parser is concerned, if a line ends in the middle of a quoted cell, then it adds back a LF and continues reading the value.

This dependency comes from lib version 2.0.0, which in turn comes from appbase 2.0.0.

opencsv looks to have had a number of forks and owners over the intervening years, but does now say that it deals properly with CR characters in values

While this character sequence shouldn't be being used in the first place, I'd argue that the parser shouldn't be interpreting characters in cell values and should pass things through verbatim for processing within templates.

der commented 1 year ago

Elsewhere we've tended to use commons-csv rather than opencsv, partly because it's included in jena. It may make sense to switch to that for dclib, though I've no idea how it handles non-standard line breaks in files.

Certainly support CR CR LF as a legal sequence for a single line break makes no sense to me, that's just mad.

der commented 1 year ago

Legal line endings are normally CR (old style Macs), CR LF (window) and of course LF so interpreting CR CR LF as two line breaks (albit in different styles) seems like the best interpretation. Preprocessing such broken files is the obvious solution and strikes me as better than bending the interpretation of CSVs.

Though, if you can provide a small test case then I guess we could check commons CSV and updated open CSV in case they have different behaviour.

skwlilac commented 1 year ago

Legal line endings are normally CR (old style Macs), CR LF (window) and of course LF so interpreting CR CR LF as two line breaks (albit in different styles) seems like the best interpretation.

Totally agree.

Interestingly (maybe) the text/csv media type registration registration states:

Encoding considerations:  CSV MIME entities consist of binary data
   [RFC6838].  As per section 4.1.1. of RFC 2046 [RFC2046], this
   media type uses CRLF to denote line breaks.  However, implementers
   should be aware that some implementations may use other values.

which is common at least with text/plain. I was blissfully unaware of this until @ajtucker pointed it out.

I guess a forensic reading would note that the absence of the word 'only' and the note to implementers do not prohibit the use of other line breaks (specifically a single LF or CR).

ajtucker commented 1 year ago

While platform line endings are indeed an issue, the crux of this particular issue is that we were surprised that the parser essentially strips any CRs from quoted cell values. We were expecting to be able to deal with these odd character sequences in a template, but they didn't make it that far.

The grammar in RFC4180 allows a quoted cell value to have CR and LF chars and I'd expect them to make it through unmolested.

When digging down, we saw that [BufferedReader](https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/BufferedReader.html#readLine()), to this day, reads lines and gobbles up any CR, LF or CR LF sequences, and the opencsv parser will then replace them with a single LF. More recent versions of opencsv appear to have a keepCarriageReturn(bool) method.

I've done some quick checking and it looks as though commons-csv:1.0 at least allows CRs to come through.

import au.com.bytecode.opencsv.{CSVParser, CSVReader}
import org.apache.commons.csv.CSVFormat

import java.io.StringReader
import scala.jdk.CollectionConverters.IterableHasAsScala

object CSVParserTest extends Examples with App {
  val acc_records = CSVFormat.RFC4180.parse(new StringReader(eg1)).asScala.toArray
  assert(acc_records(1).get(0) == "line1\r\nline2")

  val oc_reader = new CSVReader(
    new StringReader(eg1),
    CSVParser.DEFAULT_SEPARATOR, CSVParser.DEFAULT_QUOTE_CHARACTER, '\u0000')
  val oc_records = oc_reader.readAll().asScala.toArray
  assert(oc_records(1)(0) == "line1\nline2")
}

trait Examples {
  val eg1: String = "a,b\r\n\"line1\r\nline2\",1"
}
der commented 1 year ago

Well very few sources of CSV conform to other aspects of RFC4180 (hence all parsers having lots of configuration options to try to cope with the vagaries of CSVs), why should this be different? :)

In particular, all CSV tools will normally cope with, and typically generate, platform-compliant non-quoted line endings not RFC4180 CR LF. So similar handling of quoted line endings seems reasonable and expected to me unless running in a strict RFC4180 mode which we would never normally do.

I still think this specific case is simply broken data that should be cleaned by preprocessing rather than cleaned by text processing in the dclib rules, feels like a better separation of concerns.

However, if this is causing problems then I'd be OK to switch dclib to commons-csv and test whether using RFC4180 mode would break too many other things or, more likely, make it an option.