Add support for fields containing line breaks (when quoted)

cgendreau commented 7 years ago

Previously reported in https://github.com/gbif/gbif-common/issues/1

As per RFC 4180:

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

The following example represents 2 "lines" but 1 record:

"1","b CRLF
zz","ccc" CRLF
"2","ddd", "eee" CRLF

This is not currently supported and causes issue with the StartRecords. In the example above:

zz will be handled as a value of column 1 (probably the identifier)
When searching for its extension records, we will advance the iterator in the extension to zz.
When reading the next line (2), no extension records will be found

The permanent fix is to replace the underlying CSV reader with TabularDataFileReader.

Note Until archives are transferred into another format (e.g. Parquet), having 1 record over multiple lines can introduce issues with other tools currently used (e.g. split command).

cgendreau commented 7 years ago

In the meantime (in case of emergency) there is now a TabularFileNormalizer in gbif-common https://github.com/gbif/gbif-common/commit/c9f9cd65a0fe1d0e5a6dc419b2a50842c2833693 (TabularFileNormalizer was part of the debugging of this issue and should not be considered as a solution for this issue)

cgendreau commented 7 years ago

In case of StarRecord, it seems we won't really have the choice to transform the extension(s) file (e.g. with TabularFileNormalizer) since this feature relies on sorted extension (files) and the sort command works with lines.

ansell commented 7 years ago

I wasn't aware that StarRecord required Darwin Core Archive extension files to be sorted. My current issue is with extension files and quoted fields with new lines inside them, (RFC4180 as you mention above allows that). I will look through the codepaths in biocache-store and here a little more to see what could be done. May end up using the normalizer to get around it though.

ansell commented 7 years ago

I think I have found the code that relies on the sorting:

https://github.com/gbif/dwca-io/blob/3eb2ff93d23270a28955e35e066d8ecdd8b74955/src/main/java/org/gbif/dwca/io/Archive.java#L226

Is it possible to instead sort the files on disk rather than expecting users to sort files before putting them into Darwin Core Archives? It seems like an unusual requirement to expect users to know this before hand, particularly given that the spec doesn't seem to mention it at this point in time:

https://tdwg.github.io/dwc/terms/guides/text

ALA are currently using the StarRecord interface through ArchiveFactory.openArchive(File) -> Archive.next(). I would like to clear the sorting issue up to make sure that image/sound loads that we do are not corrupted by mistakes or slightly unusual practices on a client end. In many cases I already clean up various line-by-line issues. However, this type of cleanup would require a bulk sort, which is impractical to be doing in an ad-hoc manner for each cleanup operation if ignoreHeaderLines != 0 as it is for almost all of the data files I encounter.

ansell commented 7 years ago

Sorry, I was confused by the method names and didn't notice where the sorting is occurring. My confusion was caused by the method named buildSortedIterator which doesn't actually sort anything itself, and I wasn't expecting to have to traceback to the ArchiveIterator to find the previous call to the sort command.

ansell commented 7 years ago

I have been thinking through the implications of the sorting process on the optimised streaming of core+extensions to users using a single StarRecord iterator.

There are two overall goals as I see it:

to keep the native linux sort (where unescaped line feed characters are always record delimiters), or improve the java version to be fast enough in very large cases
cater for RFC4180 quoted fields with line feed characters https://tools.ietf.org/html/rfc4180#section-2

One method may be to ensure that the on-disk sort phase is known not to be dealing with line feed characters, using a rewritten file with line feed escapes for the sort. Then, to be consistent, that could be followed by a new parse and rewrite phase to regenerate the sorted file using a standard profile such as defined RFC4180 Section 2. This may also however require that unsorted fields also go through the standardisation process and also hence have a temp file created in every case. Given the potential simplifications of knowing files used for iteration are using a standard CSV profile, it may be worth the IO time (should not be limited by CPU resources) to do that for every file.

There are some relative performance differences between CSV parsing libraries, so we could also look into alternative parsers other than SuperCSV. My personal experience is with jackson-dataformat-csv, which is a close second or third fastest according to the 2014 Univocity benchmarks, with SuperCSV not far behind https://github.com/uniVocity/csv-parsers-comparison

I haven't noticed any performance issues stream processing even very large CSV files with jackson-dataformat-csv using List<String> as its data structure.

My code for using Jackson for CSV parsing (at least in my simple cases) is available at: https://github.com/ansell/csvstream/blob/master/src/main/java/com/github/ansell/csv/stream/CSVStream.java By comparison to iterator based code, that code is based on lambda's, so users send in the data operations they want to perform as lambda's, and the skeleton code for iteration only needs to be coded and tested in one place. I have used that lambda code in my darwin core archive utility, which I wrote before knowing about this library https://github.com/ansell/dwca-utils/tree/master/src/main/java/com/github/ansell/dwca . It doesn't have the features that this library has, as it is mostly used to generate metadata.xml files, check metadata.xml files for correctness and to generate some statistics using my csvsum library so I can visually verify data quality and debug load issues: https://github.com/ansell/csvsum/blob/master/src/main/java/com/github/ansell/csv/sum/CSVSummariser.java

Rather than having me tweak the algorithm here immediately, I will do some prototype work on a branch in ansell/dwca-utils to see what an equivalent Iterator<StarRecord> (or lambda) solution may look like and what performance penalties I find in my testing.

cgendreau commented 7 years ago

"Then, to be consistent, that could be followed by a new parse and rewrite phase to regenerate the sorted file using a standard profile such as defined RFC4180 Section 2."

I was not really planing to rewrite the files after sorting mostly to make it easier for others downstream in the process. I wouldn't mind too much but the pre-process is starting to be very intensive.

What I had in mind was something around those lines (only if StarRecord is required) :

prepare(...) method similar to the one in DataFileFactory [1].
- For all extensions, rewrite the files in a normalized format
- For all extensions, sort the normalized file
- Do the same for the core. CSV parsing library will allow us to read it correctly but we need it sorted if we have extension(s).
- Iterator or lamba returning StartRecord

As for the CSV parsing library we need to check but short term jackson libraries might cause issues in our stack but it would be possible to 'shade' it.

[1] GBIF data quality related code is mostly here: https://github.com/gbif/gbif-data-validator

ansell commented 7 years ago

Jackson 1 and 2 are completely separate by design (both Java package names and Maven Artifact/GroupId's) and can be on the classpath together:

https://github.com/FasterXML/jackson/wiki/Jackson-Releases

cgendreau commented 7 years ago

I tried a Jackson-based TabularDataFileReader on that branch https://github.com/gbif/gbif-common/tree/jackson-csv

ansell commented 7 years ago

The Jackson implementation doesn't look too difficult to maintain. However, I have not been able to reconfigure it in my experimentation to replace LF with "\n" or another sequence to be compatible with line-by-line UNIX tools.

The javadoc seems to indicate that the escape character is only used by the parser:

 * <li>escapeChar (int) [default: -1 meaning "none"]: character, if any, used to
 *   escape values. Most commonly defined as backslash ('\'). Only used by parser;
 *   generator only uses quoting, including doubling up of quotes to indicate quote char
 *   itself.

https://github.com/FasterXML/jackson-dataformat-csv/blob/master/src/main/java/com/fasterxml/jackson/dataformat/csv/CsvSchema.java#L34

The Jackson maintainers are open to new features in general so it is possible that we could get it accepted as a non-default feature and still maintain compatibility with current users.

cgendreau commented 7 years ago

Test implementation on https://github.com/gbif/dwca-io/tree/new-csv-handling

Requires gbif-common (jackson-csv branch) to be installed locally: https://github.com/gbif/gbif-common/tree/jackson-csv

Entry point: https://github.com/gbif/dwca-io/blob/new-csv-handling/src/main/java/org/gbif/dwc/DwcFiles.java

ansell commented 7 years ago

That should work for me as far as I can tell, although I haven't integrated it with biocache-store yet.

I added a unit test of the specific case that is causing it to fail right now. You can find that commit on the ALA branch at:

https://github.com/gbif/dwca-io/compare/new-csv-handling...AtlasOfLivingAustralia:new-csv-handling-ansell

I also merged master into that branch, but the key commit is:

https://github.com/AtlasOfLivingAustralia/dwca-io/commit/a4ba6bc2341c6cc0265cfc45d55c981285bf4c5a

I also rediscovered an issue with some default values from Darwin Core Text Guide not being used in ArchiveFile, and opened issue #29 to track that.

ansell commented 7 years ago

I have been working through the cases where the ALA data quality analysis has either RECORD_NOT_UNIQUELY_IDENTIFIED or COLUMN_MISMATCH results for a few data resources and I cannot replicate any of those results so far (haven't completed the review yet though). I have an hypothesis that this may be the cause for the data resources that I have looked through so far and it may be useful to do another run of the data-quality-validator after integrating this fix to compare the results.

Is it possible for the fixes for this issue be integrated before #29 which may take some time to get a response from TDWG now that it has been referred there?

The workaround for #29 in the ALA case has just been to always insert the defaults even if they match the spec values so it isn't as troublesome as this issue.

cgendreau commented 7 years ago

I was not planning to wait (after https://github.com/tdwg/dwc/issues/140) considering it will now follow the XSD.

My main blocker was on gbif-common which this last commit solves: https://github.com/gbif/gbif-common/commit/4424ccbbfd7566bbd16e01e7e03539662aab6052

cgendreau commented 7 years ago

fixed by dwca-io-1.31

gbif / dwca-io

Add support for fields containing line breaks (when quoted) #26