Closed pgasu closed 2 years ago
I ran the same operations for version 68, and I got another couple of errors (as shown below). Please note that if I don't use 'parallel' with dwc-stream, the overall script results in the first error and then stops 'dwc-stream' operation, and runs the rest of the following commands. So , may be we need to modify 'dwc-stream' such that if it it encounters an error or corrupted file, just print the warning/error, skip the file and continue with the rest of the datasets.
time preston history --log tsv --remote https://deeplinker.bio | head -n68 | tail -n1 | cut -f3 |\
preston cat --remote https://deeplinker.bio |\
parallel --pipe preston dwc-stream --remote https://deeplinker.bio |\
jq --raw-output '.["http://rs.tdwg.org/dwc/terms/scientificName"] + "," +\
.["http://rs.tdwg.org/dwc/terms/taxonRank"] + "," + .["http://rs.tdwg.org/dwc/terms/class"] + "," +\
.["rowType"] + "," + .["contentId"]' | \
sort | uniq > raw_68.txt
java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('.' (code 46)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 114943, column: 489]
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.text.ParseException: Unexpected character ('.' (code 46)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 114943, column: 489]
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
... 16 more
java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('.' (code 46)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 114943, column: 489]
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.text.ParseException: Unexpected character ('.' (code 46)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 114943, column: 489]
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
... 16 more
[main] ERROR bio.guoda.preston.cmd.CmdLine - unexpected exception
java.lang.RuntimeException: Input length = 1
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:417)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:109)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:443)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextQuotedString(CsvDecoder.java:813)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:659)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser._handleUnnamedValue(CsvParser.java:903)
java.lang.RuntimeException: Input length = 1
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:620)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:281)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:249)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:26)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:280)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)
... 19 more
java.lang.RuntimeException: Input length = 1
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:417)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:109)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:443)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextQuotedString(CsvDecoder.java:813)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:659)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser._handleUnnamedValue(CsvParser.java:903)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:620)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:281)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:249)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:26)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:280)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)
... 19 more
@pgasu I was able to reproduce the issue you found. To help identify the suspicious resource, I added some logging information to the upcoming Preston release. Note that in the error message below, the hash uri (or content id) is included. Now that we know the content ids, the exact provenance can be derived of the suspicious resources can be more easily derived.
[main] WARN bio.guoda.preston.process.DwcRecordExtractor - suspicious DwC resource [hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89!/occurrence.txt]
at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:87)
at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:62)
at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:55)
at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:44)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.lang.RuntimeException: Truncated ZIP file
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:417)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:109)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:83)
... 16 more
Caused by: java.io.IOException: Truncated ZIP file
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:636)
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:528)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:443)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextUnquotedString(CsvDecoder.java:764)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:714)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser._handleUnnamedValue(CsvParser.java:903)
at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:620)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:281)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:249)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:26)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:280)
at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)
... 20 more
[main] WARN bio.guoda.preston.process.DwcRecordExtractor - suspicious DwC resource [hash://sha256/9e19b4069017c64a31d693fc5cf296e264c40aadaa475c35e36aea0e0348e6fb] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/9e19b4069017c64a31d693fc5cf296e264c40aadaa475c35e36aea0e0348e6fb!/DarwinCore.txt]
at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:87)
at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:62)
at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:55)
at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:44)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('V' (code 86)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 859, column: 193]
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:83)
... 16 more
Caused by: java.text.ParseException: Unexpected character ('V' (code 86)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 859, column: 193]
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
... 17 more
[main] WARN bio.guoda.preston.process.DwcRecordExtractor - suspicious DwC resource [hash://sha256/db5cd1d7c8738c6f5ad241878a4fdc9929b719e4261ef9e1b72bbe8275e6eb92] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/db5cd1d7c8738c6f5ad241878a4fdc9929b719e4261ef9e1b72bbe8275e6eb92!/DarwinCore.txt]
at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:87)
at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:62)
at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:55)
at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:44)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('P' (code 80)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 1961, column: 299]
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:83)
... 16 more
Caused by: java.text.ParseException: Unexpected character ('P' (code 80)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 1961, column: 299]
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
... 17 more
[main] WARN bio.guoda.preston.process.DwcRecordExtractor - suspicious DwC resource [hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c!/occurrence.txt]
at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:87)
at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:62)
at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:55)
at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:44)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('"' (code 34)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 2, column: 24]
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:83)
... 16 more
Caused by: java.text.ParseException: Unexpected character ('"' (code 34)): Expected column separator character (',' (code 44)) or end-of-line
at [Source: (BufferedReader); line: 2, column: 24]
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
... 17 more
The follow content ids were causing the issues seen
hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89
was retrieved from http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021 on 2019-09-01T11:10:04.772Z .
from provenance:
$ preston history --log tsv | head -n38 | tail -n1 | cut -f3 | preston cat | grep --after 10 --before 10 hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89
<hash://sha256/85997b54f8ad86e21b3b15ed392cd30280c0965feb2a63b467641f22b3c51db4> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/85997b54f8ad86e21b3b15ed392cd30280c0965feb2a63b467641f22b3c51db4> <http://www.w3.org/ns/prov#qualifiedGeneration> <c0d6631a-1ec6-4bb1-a429-077791e63d47> .
<c0d6631a-1ec6-4bb1-a429-077791e63d47> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<c0d6631a-1ec6-4bb1-a429-077791e63d47> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<c0d6631a-1ec6-4bb1-a429-077791e63d47> <http://www.w3.org/ns/prov#used> <http://ipt-inpn.gbif.fr/eml.do?r=IPT-6386AA2E-7597-5FE3-E053-2614A8C00573> .
<http://ipt-inpn.gbif.fr/eml.do?r=IPT-6386AA2E-7597-5FE3-E053-2614A8C00573> <http://purl.org/pav/hasVersion> <hash://sha256/85997b54f8ad86e21b3b15ed392cd30280c0965feb2a63b467641f22b3c51db4> .
<hash://sha256/14afc84936c197ab0ce367dc20845aef9f85ba9c7c687f3ec90b8e6306de706c> <http://www.w3.org/ns/prov#hadMember> <66551429-5e35-4848-94f5-b13bcad669a5> .
<66551429-5e35-4848-94f5-b13bcad669a5> <http://www.w3.org/1999/02/22-rdf-syntax-ns#seeAlso> <https://doi.org/10.15468/3kmwvz> .
<66551429-5e35-4848-94f5-b13bcad669a5> <http://www.w3.org/ns/prov#hadMember> <http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> .
<http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> <http://purl.org/dc/elements/1.1/format> "application/dwca" .
<hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89> <http://www.w3.org/ns/prov#generatedAtTime> "2019-09-01T11:10:04.772Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89> <http://www.w3.org/ns/prov#qualifiedGeneration> <726f1fa3-013d-4aaf-88f1-3b7913ef0b2b> .
<726f1fa3-013d-4aaf-88f1-3b7913ef0b2b> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<726f1fa3-013d-4aaf-88f1-3b7913ef0b2b> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<726f1fa3-013d-4aaf-88f1-3b7913ef0b2b> <http://www.w3.org/ns/prov#used> <http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> .
<http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> <http://purl.org/pav/hasVersion> <hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89> .
<66551429-5e35-4848-94f5-b13bcad669a5> <http://www.w3.org/ns/prov#hadMember> <http://ipt-inpn.gbif.fr/eml.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> .
<http://ipt-inpn.gbif.fr/eml.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> <http://purl.org/dc/elements/1.1/format> "application/eml" .
<hash://sha256/587574aa16ade59c5320f403fccf3410cef403a21a24c4f830f336d8a7965484> <http://www.w3.org/ns/prov#generatedAtTime> "2019-09-01T11:10:04.901Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/587574aa16ade59c5320f403fccf3410cef403a21a24c4f830f336d8a7965484> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/587574aa16ade59c5320f403fccf3410cef403a21a24c4f830f336d8a7965484> <http://www.w3.org/ns/prov#qualifiedGeneration> <cf62d96f-14fd-45bf-8847-5cccecdcb68c> .
<cf62d96f-14fd-45bf-8847-5cccecdcb68c> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<cf62d96f-14fd-45bf-8847-5cccecdcb68c> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<cf62d96f-14fd-45bf-8847-5cccecdcb68c> <http://www.w3.org/ns/prov#used> <http://ipt-inpn.gbif.fr/eml.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> .
<http://ipt-inpn.gbif.fr/eml.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> <http://purl.org/pav/hasVersion> <hash://sha256/587574aa16ade59c5320f403fccf3410cef403a21a24c4f830f336d8a7965484> .
<hash://sha256/14afc84936c197ab0ce367dc20845aef9f85ba9c7c687f3ec90b8e6306de706c> <http://www.w3.org/ns/prov#hadMember> <1fec2829-4ac0-4329-b448-6f768b8d3427> .
(truncated zip) (see attached)
hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c (unexpected character)
was retrieved from https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E on 2019-09-01T12:49:22.050Z . See also https://www.gbif.org/dataset/62d82928-dc6f-40dc-85b3-f2be47e7b49a and https://doi.org/10.15468/17e8en .
from provenance -
$ preston history --log tsv | head -n38 | tail -n1 | cut -f3 | preston cat | grep --after 10 --before 10 hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c
<hash://sha256/7d4385886aa3e0ebd3533651b5fd9a1d6c06738fd66a87a20ec502aee077bfed> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/7d4385886aa3e0ebd3533651b5fd9a1d6c06738fd66a87a20ec502aee077bfed> <http://www.w3.org/ns/prov#qualifiedGeneration> <c2b5356e-d594-4be4-abbf-eca37a66607e> .
<c2b5356e-d594-4be4-abbf-eca37a66607e> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<c2b5356e-d594-4be4-abbf-eca37a66607e> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<c2b5356e-d594-4be4-abbf-eca37a66607e> <http://www.w3.org/ns/prov#used> <http://tb.plazi.org/GgServer/dwca/FFB3092B7068BC729535FFB9FFA05F36.zip> .
<http://tb.plazi.org/GgServer/dwca/FFB3092B7068BC729535FFB9FFA05F36.zip> <http://purl.org/pav/hasVersion> <hash://sha256/7d4385886aa3e0ebd3533651b5fd9a1d6c06738fd66a87a20ec502aee077bfed> .
<hash://sha256/afdbe7a01bb39c510f0dee1c467541447e388adc7a35dd2d343029ed281f490a> <http://www.w3.org/ns/prov#hadMember> <62d82928-dc6f-40dc-85b3-f2be47e7b49a> .
<62d82928-dc6f-40dc-85b3-f2be47e7b49a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#seeAlso> <https://doi.org/10.15468/17e8en> .
<62d82928-dc6f-40dc-85b3-f2be47e7b49a> <http://www.w3.org/ns/prov#hadMember> <https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E> .
<https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E> <http://purl.org/dc/elements/1.1/format> "application/dwca" .
<hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c> <http://www.w3.org/ns/prov#generatedAtTime> "2019-09-01T12:49:22.050Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c> <http://www.w3.org/ns/prov#qualifiedGeneration> <4a553dbd-6d85-4cc4-9d02-db284af50b92> .
<4a553dbd-6d85-4cc4-9d02-db284af50b92> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<4a553dbd-6d85-4cc4-9d02-db284af50b92> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<4a553dbd-6d85-4cc4-9d02-db284af50b92> <http://www.w3.org/ns/prov#used> <https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E> .
<https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E> <http://purl.org/pav/hasVersion> <hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c> .
<hash://sha256/afdbe7a01bb39c510f0dee1c467541447e388adc7a35dd2d343029ed281f490a> <http://www.w3.org/ns/prov#hadMember> <09cac3e8-42f4-41cf-a121-488602a8429d> .
<09cac3e8-42f4-41cf-a121-488602a8429d> <http://www.w3.org/1999/02/22-rdf-syntax-ns#seeAlso> <https://doi.org/10.5281/zenodo.1048224> .
<09cac3e8-42f4-41cf-a121-488602a8429d> <http://www.w3.org/ns/prov#hadMember> <http://tb.plazi.org/GgServer/dwca/FFCC8B0CFFA3FF80A43F537CFFC6F93F.zip> .
<http://tb.plazi.org/GgServer/dwca/FFCC8B0CFFA3FF80A43F537CFFC6F93F.zip> <http://purl.org/dc/elements/1.1/format> "application/dwca" .
<hash://sha256/a0cc1850821e1a9914e24fb7d6dff2bd9bb352afc269ed0c67220017cd520d44> <http://www.w3.org/ns/prov#generatedAtTime> "2019-09-01T12:49:22.112Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/a0cc1850821e1a9914e24fb7d6dff2bd9bb352afc269ed0c67220017cd520d44> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/a0cc1850821e1a9914e24fb7d6dff2bd9bb352afc269ed0c67220017cd520d44> <http://www.w3.org/ns/prov#qualifiedGeneration> <cce9fc0e-9eb5-4919-b351-fb32ff077020> .
<cce9fc0e-9eb5-4919-b351-fb32ff077020> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<cce9fc0e-9eb5-4919-b351-fb32ff077020> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<cce9fc0e-9eb5-4919-b351-fb32ff077020> <http://www.w3.org/ns/prov#used> <http://tb.plazi.org/GgServer/dwca/FFCC8B0CFFA3FF80A43F537CFFC6F93F.zip> .
In validating the suspicious dwca file with content id hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c
we found that the GBIF validator didn't like the content served via https://deeplinker.bio/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c either.
from https://www.gbif.org/tools/data-validator/1a026a9d-7441-457b-a032-1b212cb6d809 -
However, the content does appear to have a valid zip structure:
$ curl "https://deeplinker.bio/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c" > tmp/some.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9694k 100 9694k 0 0 3923k 0 0:00:02 0:00:02 --:--:-- 3923k
$ unzip -l tmp/some.zip
Archive: tmp/some.zip
Length Date Time Name
--------- ---------- ----- ----
6635 2017-03-14 09:53 eml.xml
2583 2017-02-21 08:38 meta.xml
93838675 2017-02-22 17:46 occurrence.txt
--------- -------
93847893 3 files
In looking up offending line using
$ preston cat --remote https://deeplinker.bio 'line:zip:hash://sha256/9e19b4069017c64a31d693fc5cf296e264c40aadaa475c35e36aea0e0348e6fb!/DarwinCore.txt!/L859'
"39393998","2019-04-18T14:40:14Z","Type of sighting=Sighting of live animal;County=Louth;Vice-county=Louth;Abundance=One;Record comment=The sensor light came on and I looked up from reading "Village" magazine to see this wee beaut walk past.","MammalsOfIreland2016-2025","MammalsOfIreland2016-2025","97209583948267984","Brendan McSherry",,"2017-02-23",,"Mullatee","54.0301013900","-6.1579993300","100","Meles meles","Species","(Linnaeus, 1758)","accepted","Eurasian Badger","ie.nbdc.dataset.MammalsOfIreland2016-2025.97209583948267984"
it appears that the double quote is not properly escaped (e.g., [...] reading "Village" magazine [...]
)
unzip supports that hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89 is not a well formed zip file.
$ preston cat --remote https://deeplinker.bio hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89 > badzip.zip
$ unzip -l badzip.zip
Archive: badzip.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of badzip.zip or
badzip.zip.zip, and cannot find badzip.zip.ZIP, period.
@pgasu so, it appears that the issues you noticed are actual issues with the integrity of the tracked content.
If you'd like you can further trace the provenance and contact the owners of the data to resolve the issue.
and preston will keep processing an archive until it hits a snag, then skips to the next archive.
Please advise on how to proceed.
Note that similarly:
$ preston cat --remote https://deeplinker.bio 'line:zip:hash://sha256/db5cd1d7c8738c6f5ad241878a4fdc9929b719e4261ef9e1b72bbe8275e6eb92!/DarwinCore.txt!/L1961'
[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 100.0% of 176 kB at 0.51 MB/s completed in < 1 minute
"37647418","2018-07-30T13:40:26Z","Determiner name=Phil Melly;Sampling method=Daytime observation;Record comment=On observation, consulted internet and is satisfied this was indeed a Hummingbird Hawk Moth. Couldn't believe was watching a Hummingbird in Ireland! Feeding that day on Verbena and "Petunia Million Bells".","MothRecordsOfIreland","MothRecordsOfIreland","bc3eac8a-5507-11e7-a40d-00155d018200","Phil Melly",,"2017-05-31",,"Bogganstown Culmullin","53.4746220300","-6.6114130600","100","Macroglossum stellatarum","Species","(Linnaeus, 1758)","accepted","Humming-bird Hawk-moth","ie.nbdc.dataset.MothRecordsOfIreland.bc3eac8a-5507-11e7-a40d-00155d018200"
where un-escaped double quotes inside the darwin core archive (i.e. [...] Feeding that day on Verbena and "Petunia Million Bells".","Mot [...]
) causes the csv parser to complain.
The associated url was http://gbif.biodiversityireland.ie/MothRecordsOfIreland.zip GBIF dataset https://www.gbif.org/dataset/be6e251d-cdf4-4784-8054-9f4be19ce3d9 and a related eml.xml is:
$ preston cat --remote https://deeplinker.bio 'zip:hash://sha256/db5cd1d7c8738c6f5ad241878a4fdc9929b719e4261ef9e1b72bbe8275e6eb92!/eml.xml'
[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 27.1% of 176 kB at 0.38 MB/[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 54.3% of 176 kB at 0.64 MB/[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 81.5% of 176 kB at 0.56 MB/[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 100.0% of 176 kB at 0.66 MB/s completed in < 1 minute
<?xml version="1.0" encoding="utf-8"?>
<eml:eml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" packageId="ie.nbdc.dataset.MothRecordsOfIreland" system="http://gbif.biodiversityireland.ie/" scope="system" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml.xsd" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1">
<dataset>
<alternateIdentifier>MothRecordsOfIreland</alternateIdentifier>
<title>Moth Records of Ireland</title>
<creator>
<individualName>
<surName>Michael O'Donnell</surName>
</individualName>
<organizationName>Collated by the National Biodiversity Data Centre from different sources</organizationName>
<address>
<country>Ireland</country>
</address>
<electronicMailAddress>micealodonnell@eircom.net</electronicMailAddress>
<onlineUrl>http://www.mothsireland.com/</onlineUrl>
</creator>
<metadataProvider>
<organizationName>National Biodiversity Data Centre, Ireland</organizationName>
<address>
<deliveryPoint>Beechfield house, Carriganore WIT West Campus</deliveryPoint>
<city>Waterford</city>
<administrativeArea>County Waterford</administrativeArea>
<country>Ireland</country>
</address>
<phone>+353 (0)51 306 240</phone>
<electronicMailAddress>info@biodiversityireland.ie</electronicMailAddress>
<onlineUrl>http://www.biodiversityireland.ie/</onlineUrl>
</metadataProvider>
<pubDate>2018-07-30Z</pubDate>
<language>en</language>
<abstract>
<para>Moth records collated from a variety of sources</para>
</abstract>
<additionalInfo>
<para>http://www.mothsireland.com/</para>
</additionalInfo>
<intellectualRights>
<para>This work is licensed under a<ulink url="http://creativecommons.org/licenses/by/4.0/legalcode"><citetitle>Creative Commons Attribution (CC-BY) 4.0 License</citetitle></ulink></para>
</intellectualRights>
<distribution>
<online>
<url function="information">http://maps.biodiversityireland.ie/DataSet/268</url>
</online>
</distribution>
<coverage>
<geographicCoverage>
<geographicDescription>The island of Ireland</geographicDescription>
<boundingCoordinates>
<westBoundingCoordinate>-10.5383265100</westBoundingCoordinate>
<eastBoundingCoordinate>-5.8212730500</eastBoundingCoordinate>
<northBoundingCoordinate>55.3117198200</northBoundingCoordinate>
<southBoundingCoordinate>51.4314347500</southBoundingCoordinate>
</boundingCoordinates>
</geographicCoverage>
<temporalCoverage>
<rangeOfDates>
<beginDate>
<calendarDate>2007</calendarDate>
</beginDate>
<endDate>
<calendarDate>2017</calendarDate>
</endDate>
</rangeOfDates>
</temporalCoverage>
</coverage>
<purpose>
<para>Record and understand the geographic distribution of moths in Ireland</para>
</purpose>
<contact>
<individualName>
<givenName>Barry</givenName>
<surName>O'Neill</surName>
</individualName>
<organizationName>National Biodiversity Data Centre, Ireland</organizationName>
<address>
<deliveryPoint>Beechfield house, Carriganore WIT West Campus</deliveryPoint>
<city>Waterford</city>
<administrativeArea>County Waterford</administrativeArea>
<country>Ireland</country>
</address>
<phone>+353 (0)51 306 240</phone>
<electronicMailAddress>boneill@biodiversityireland.ie</electronicMailAddress>
<onlineUrl>http://www.biodiversityireland.ie/</onlineUrl>
</contact>
<methods>
<methodStep>
<description>
<para>Field observations supported by photographs where necessary</para>
</description>
</methodStep>
<qualityControl>
<description>
<para>All records validated prior to submission to MothsIreland</para>
</description>
</qualityControl>
</methods>
</dataset>
<additionalMetadata>
<metadata>
<gbif>
<dateStamp>2018-07-30T14:43:18.6569988+01:00</dateStamp>
<citation>National Biodiversity Data Centre: Collated by the National Biodiversity Data Centre from different sources - Moth Records of Ireland. Dataset/Occurrence.</citation>
</gbif>
</metadata>
</additionalMetadata>
</eml:eml>
@pgasu I just released preston v0.3.7 . This release includes a fix for the #161 and improves the error logging when encountering funny dwca (truncated zip files, invalid csv files).
So, as far as I can tell, your issues have been addressed. It is now up to the data providers to fix their data archives.
Thanks @jhpoelen. This looks great.
Every time I run 'dwc-stream' on any preston archive to extract darwin core data, I get the below error. I have tried this on multiple versions (versions 1, 38, 62), and they all seem to give the same error. For example, I used the following command when generated the below error.
preston history --log tsv --remote https://deeplinker.bio | head -n38 | tail -n1 | cut -f3 | preston cat --remote https://deeplinker.bio | preston dwc-stream --remote https://deeplinker.bio | jq --raw-output '.["http://rs.tdwg.org/dwc/terms/scientificName"] + "," + .["http://rs.tdwg.org/dwc/terms/taxonRank"] + "," + .["http://rs.tdwg.org/dwc/terms/class"] + "," + .["rowType"] + "," + .["contentId"]' | sort | uniq > raw_38.txt
Even after generating the error, the command runs through and output the relevant data. However, I am not sure if the command 'dwc-stream' runs through the datasets that appear after the corrupt file causing this error. Running this on version 1 and 38, gave me the same error, but the output file from version 1 was much bigger in size than what we got from version 38, which is somewhat unexpected.