AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

DwCACreator tool uses a non-standard CSV parser and fails #343

Closed ansell closed 4 years ago

ansell commented 4 years ago

The DwCACreator is failing to parse the syntactically valid output from the image-service export.csv service, other than the two bugs identified in the image-service tracker for incorrect URLs and the occurrenceID field being duplicated.

aws-bstore-1b 2019-09-20 12:47:09,290 INFO : [Cassandra3PersistenceManager] - All threads have completed paging
aws-bstore-1b 2019-09-20 12:47:09,831 INFO : [DwCACreator] - Downloading images archive extract....
aws-bstore-1b 2019-09-20 12:48:06,338 INFO : [DwCACreator] - Downloaded images archive extract to /data/tmp/images-export/images-export.csv.gz
aws-bstore-1b 2019-09-20 12:48:06,338 INFO : [DwCACreator] - Extracting Gzip....
aws-bstore-1b 2019-09-20 12:48:13,147 INFO : [DwCACreator] - Splitting into separate files to..../data/tmp/images-export/split

aws-bstore-1b 2019-09-20 13:28:47,017 ERROR: [DwCACreator] - Unterminated quoted field at end of CSV line. Beginning of lost text: [ Henry Cook-3973.jpg,,jpg,CC BY-NC-SA 3.0,,43eb346f65da49215544390d4774312b,369431,1624,1080,5,e8...]
java.io.IOException: Unterminated quoted field at end of CSV line. Beginning of lost text: [ Henry Cook-3973.jpg,,jpg,CC BY-NC-SA 3.0,,43eb346f65da49215544390d4774312b,369431,1624,1080,5,e8...]
    at com.opencsv.CSVReader.readNext(CSVReader.java:334)
    at au.org.ala.biocache.export.DwCACreator.addImageExportsToArchives(DwCACreator.scala:450)
    at au.org.ala.biocache.export.DwCACreator$.main(DwCACreator.scala:178)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:130)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)
Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Unterminated quoted field at end of CSV line. Beginning of lost text: [ Henry Cook-3973.jpg,,jpg,CC BY-NC-SA 3.0,,43eb346f65da49215544390d4774312b,369431,1624,1080,5,e8...]
    at au.org.ala.biocache.export.DwCACreator$.main(DwCACreator.scala:182)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:130)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)
Caused by: java.io.IOException: Unterminated quoted field at end of CSV line. Beginning of lost text: [ Henry Cook-3973.jpg,,jpg,CC BY-NC-SA 3.0,,43eb346f65da49215544390d4774312b,369431,1624,1080,5,e8...]
    at com.opencsv.CSVReader.readNext(CSVReader.java:334)
    at au.org.ala.biocache.export.DwCACreator.addImageExportsToArchives(DwCACreator.scala:450)
    at au.org.ala.biocache.export.DwCACreator$.main(DwCACreator.scala:178)
    ... 2 more

Parsing it with a standards-compliant CSV parser succeeds, so the issue is either in the configuration or the operation of the opencsv parser.

djtfmartin commented 4 years ago

This is now fixed, but the copy of the archives fails for some reason. I suspect its bastion related.

ansell commented 4 years ago

I allowed access through on the internal AWS IP address and it appears to have worked.