AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

GBIF dwca-io 1.31 to 2.3 #303

Closed reupost closed 5 years ago

reupost commented 5 years ago

This fixes the issue where linebreaks within fields in DWCA's were being interpreted as the start of new records.

ansell commented 5 years ago

@reupost thanks! I will test and integrate this today

reupost commented 5 years ago

Thanks @ansell, sorry about the missed "openArchive" reference, it was my clumsiness with git that meant I was making these changes manually and I must have missed that one. One thing I've found with further testing on our side, is the new version seems to be a little fussier with character encodings than the old one. It works fine with UTF-8, but fails on one of our archives containing an occurrences.csv file which is 'Western (Mac OS Roman)'-encoded. The older build handles that file without issue. Possibly there are other encodings that are no longer tolerated. I'll do some more investigating.

ansell commented 5 years ago

I fixed the nexus.ala.org.au to repository.gbif.org linkage and the Travis build succeeded after that.

If you find that the encoding issue is a biocache-store bug create an issue here, otherwise possibly a dwca-io issue. I worked on the new dwca-io routines, to use jackson-csv, but I don't remember doing encoding tests at the time.

reupost commented 5 years ago

From Matt @ GBIF: "It might be worth warning ALA that we don't have a single dataset that isn't UTF-8 (!), so we might not have found many issues around encodings. [also] dwca-io only accepts Java charset names, which in this case are different (MacRoman vs. Macintosh) from the IANA names."