Closed cgendreau closed 7 years ago
In the meantime (in case of emergency) there is now a TabularFileNormalizer in gbif-common https://github.com/gbif/gbif-common/commit/c9f9cd65a0fe1d0e5a6dc419b2a50842c2833693 (TabularFileNormalizer was part of the debugging of this issue and should not be considered as a solution for this issue)
In case of StarRecord
, it seems we won't really have the choice to transform the extension(s) file (e.g. with TabularFileNormalizer
) since this feature relies on sorted extension (files) and the sort
command works with lines.
I wasn't aware that StarRecord required Darwin Core Archive extension files to be sorted. My current issue is with extension files and quoted fields with new lines inside them, (RFC4180 as you mention above allows that). I will look through the codepaths in biocache-store and here a little more to see what could be done. May end up using the normalizer to get around it though.
I think I have found the code that relies on the sorting:
Is it possible to instead sort the files on disk rather than expecting users to sort files before putting them into Darwin Core Archives? It seems like an unusual requirement to expect users to know this before hand, particularly given that the spec doesn't seem to mention it at this point in time:
https://tdwg.github.io/dwc/terms/guides/text
ALA are currently using the StarRecord
interface through ArchiveFactory.openArchive(File)
-> Archive.next()
. I would like to clear the sorting issue up to make sure that image/sound loads that we do are not corrupted by mistakes or slightly unusual practices on a client end. In many cases I already clean up various line-by-line issues. However, this type of cleanup would require a bulk sort, which is impractical to be doing in an ad-hoc manner for each cleanup operation if ignoreHeaderLines != 0
as it is for almost all of the data files I encounter.
Sorry, I was confused by the method names and didn't notice where the sorting is occurring. My confusion was caused by the method named buildSortedIterator
which doesn't actually sort anything itself, and I wasn't expecting to have to traceback to the ArchiveIterator
to find the previous call to the sort command.
I have been thinking through the implications of the sorting process on the optimised streaming of core+extensions to users using a single StarRecord
iterator.
There are two overall goals as I see it:
One method may be to ensure that the on-disk sort phase is known not to be dealing with line feed characters, using a rewritten file with line feed escapes for the sort. Then, to be consistent, that could be followed by a new parse and rewrite phase to regenerate the sorted file using a standard profile such as defined RFC4180 Section 2. This may also however require that unsorted fields also go through the standardisation process and also hence have a temp file created in every case. Given the potential simplifications of knowing files used for iteration are using a standard CSV profile, it may be worth the IO time (should not be limited by CPU resources) to do that for every file.
There are some relative performance differences between CSV parsing libraries, so we could also look into alternative parsers other than SuperCSV. My personal experience is with jackson-dataformat-csv, which is a close second or third fastest according to the 2014 Univocity benchmarks, with SuperCSV not far behind https://github.com/uniVocity/csv-parsers-comparison
I haven't noticed any performance issues stream processing even very large CSV files with jackson-dataformat-csv using List<String>
as its data structure.
My code for using Jackson for CSV parsing (at least in my simple cases) is available at: https://github.com/ansell/csvstream/blob/master/src/main/java/com/github/ansell/csv/stream/CSVStream.java By comparison to iterator based code, that code is based on lambda's, so users send in the data operations they want to perform as lambda's, and the skeleton code for iteration only needs to be coded and tested in one place. I have used that lambda code in my darwin core archive utility, which I wrote before knowing about this library https://github.com/ansell/dwca-utils/tree/master/src/main/java/com/github/ansell/dwca . It doesn't have the features that this library has, as it is mostly used to generate metadata.xml files, check metadata.xml files for correctness and to generate some statistics using my csvsum library so I can visually verify data quality and debug load issues: https://github.com/ansell/csvsum/blob/master/src/main/java/com/github/ansell/csv/sum/CSVSummariser.java
Rather than having me tweak the algorithm here immediately, I will do some prototype work on a branch in ansell/dwca-utils to see what an equivalent Iterator<StarRecord>
(or lambda) solution may look like and what performance penalties I find in my testing.
"Then, to be consistent, that could be followed by a new parse and rewrite phase to regenerate the sorted file using a standard profile such as defined RFC4180 Section 2."
I was not really planing to rewrite the files after sorting mostly to make it easier for others downstream in the process. I wouldn't mind too much but the pre-process is starting to be very intensive.
What I had in mind was something around those lines (only if StarRecord
is required) :
StartRecord
As for the CSV parsing library we need to check but short term jackson
libraries might cause issues in our stack but it would be possible to 'shade' it.
[1] GBIF data quality related code is mostly here: https://github.com/gbif/gbif-data-validator
Jackson 1 and 2 are completely separate by design (both Java package names and Maven Artifact/GroupId's) and can be on the classpath together:
I tried a Jackson-based TabularDataFileReader
on that branch
https://github.com/gbif/gbif-common/tree/jackson-csv
The Jackson implementation doesn't look too difficult to maintain. However, I have not been able to reconfigure it in my experimentation to replace LF with "\n" or another sequence to be compatible with line-by-line UNIX tools.
The javadoc seems to indicate that the escape character is only used by the parser:
* <li>escapeChar (int) [default: -1 meaning "none"]: character, if any, used to
* escape values. Most commonly defined as backslash ('\'). Only used by parser;
* generator only uses quoting, including doubling up of quotes to indicate quote char
* itself.
The Jackson maintainers are open to new features in general so it is possible that we could get it accepted as a non-default feature and still maintain compatibility with current users.
Test implementation on https://github.com/gbif/dwca-io/tree/new-csv-handling
Requires gbif-common (jackson-csv branch) to be installed locally: https://github.com/gbif/gbif-common/tree/jackson-csv
Entry point: https://github.com/gbif/dwca-io/blob/new-csv-handling/src/main/java/org/gbif/dwc/DwcFiles.java
That should work for me as far as I can tell, although I haven't integrated it with biocache-store yet.
I added a unit test of the specific case that is causing it to fail right now. You can find that commit on the ALA branch at:
I also merged master into that branch, but the key commit is:
https://github.com/AtlasOfLivingAustralia/dwca-io/commit/a4ba6bc2341c6cc0265cfc45d55c981285bf4c5a
I also rediscovered an issue with some default values from Darwin Core Text Guide not being used in ArchiveFile, and opened issue #29 to track that.
I have been working through the cases where the ALA data quality analysis has either RECORD_NOT_UNIQUELY_IDENTIFIED
or COLUMN_MISMATCH
results for a few data resources and I cannot replicate any of those results so far (haven't completed the review yet though). I have an hypothesis that this may be the cause for the data resources that I have looked through so far and it may be useful to do another run of the data-quality-validator after integrating this fix to compare the results.
Is it possible for the fixes for this issue be integrated before #29 which may take some time to get a response from TDWG now that it has been referred there?
The workaround for #29 in the ALA case has just been to always insert the defaults even if they match the spec values so it isn't as troublesome as this issue.
I was not planning to wait (after https://github.com/tdwg/dwc/issues/140) considering it will now follow the XSD.
My main blocker was on gbif-common which this last commit solves: https://github.com/gbif/gbif-common/commit/4424ccbbfd7566bbd16e01e7e03539662aab6052
fixed by dwca-io-1.31
Previously reported in https://github.com/gbif/gbif-common/issues/1
As per RFC 4180:
The following example represents 2 "lines" but 1 record:
This is not currently supported and causes issue with the StartRecords. In the example above:
zz
will be handled as a value of column 1 (probably the identifier)zz
.2
), no extension records will be foundThe permanent fix is to replace the underlying CSV reader with TabularDataFileReader.
Note Until archives are transferred into another format (e.g. Parquet), having 1 record over multiple lines can introduce issues with other tools currently used (e.g. split command).