bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

Error "Truncated ZIP file" when running 'dwc-stream' on any version of preston-archive #158

Closed pgasu closed 2 years ago

pgasu commented 2 years ago

Every time I run 'dwc-stream' on any preston archive to extract darwin core data, I get the below error. I have tried this on multiple versions (versions 1, 38, 62), and they all seem to give the same error. For example, I used the following command when generated the below error. preston history --log tsv --remote https://deeplinker.bio | head -n38 | tail -n1 | cut -f3 | preston cat --remote https://deeplinker.bio | preston dwc-stream --remote https://deeplinker.bio | jq --raw-output '.["http://rs.tdwg.org/dwc/terms/scientificName"] + "," + .["http://rs.tdwg.org/dwc/terms/taxonRank"] + "," + .["http://rs.tdwg.org/dwc/terms/class"] + "," + .["rowType"] + "," + .["contentId"]' | sort | uniq > raw_38.txt

Even after generating the error, the command runs through and output the relevant data. However, I am not sure if the command 'dwc-stream' runs through the datasets that appear after the corrupt file causing this error. Running this on version 1 and 38, gave me the same error, but the output file from version 1 was much bigger in size than what we got from version 38, which is somewhat unexpected.

java.lang.RuntimeException: Truncated ZIP file
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:417)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:109)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
        at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
        at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
        at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
        at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
        at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
        at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.io.IOException: Truncated ZIP file
        at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:636)
        at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:528)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read1(BufferedReader.java:212)
        at java.io.BufferedReader.read(BufferedReader.java:286)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:443)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextUnquotedString(CsvDecoder.java:764)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:714)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser._handleUnnamedValue(CsvParser.java:903)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:620)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:281)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:249)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:26)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:280)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)
        ... 19 more
java.lang.RuntimeException: Truncated ZIP file
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:417)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:109)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
        at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
        at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
        at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
        at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
        at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
        at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.io.IOException: Truncated ZIP file
        at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:636)
        at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:528)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read1(BufferedReader.java:212)
        at java.io.BufferedReader.read(BufferedReader.java:286)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:443)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextUnquotedString(CsvDecoder.java:764)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:714)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser._handleUnnamedValue(CsvParser.java:903)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:620)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:281)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:249)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:26)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:280)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)
        ... 19 more
pgasu commented 2 years ago

I ran the same operations for version 68, and I got another couple of errors (as shown below). Please note that if I don't use 'parallel' with dwc-stream, the overall script results in the first error and then stops 'dwc-stream' operation, and runs the rest of the following commands. So , may be we need to modify 'dwc-stream' such that if it it encounters an error or corrupted file, just print the warning/error, skip the file and continue with the rest of the datasets.

time preston history --log tsv --remote https://deeplinker.bio | head -n68 | tail -n1 | cut -f3 |\ 
preston cat --remote https://deeplinker.bio |\ 
parallel --pipe preston dwc-stream --remote https://deeplinker.bio |\ 
jq --raw-output '.["http://rs.tdwg.org/dwc/terms/scientificName"] + "," +\ 
.["http://rs.tdwg.org/dwc/terms/taxonRank"] + "," + .["http://rs.tdwg.org/dwc/terms/class"] + "," +\ 
.["rowType"] + "," + .["contentId"]' | \
sort | uniq > raw_68.txt
java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('.' (code 46)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 114943, column: 489]
        at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
        at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
        at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
        at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
        at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
        at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.text.ParseException: Unexpected character ('.' (code 46)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 114943, column: 489]
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
        at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
        ... 16 more
java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('.' (code 46)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 114943, column: 489]
        at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
        at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
        at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
        at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
        at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
        at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.text.ParseException: Unexpected character ('.' (code 46)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 114943, column: 489]
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
        at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
        ... 16 more
[main] ERROR bio.guoda.preston.cmd.CmdLine - unexpected exception
java.lang.RuntimeException: Input length = 1
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:417)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:109)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
        at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
        at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
        at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
        at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
        at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
        at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
        at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read1(BufferedReader.java:212)
        at java.io.BufferedReader.read(BufferedReader.java:286)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:443)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextQuotedString(CsvDecoder.java:813)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:659)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser._handleUnnamedValue(CsvParser.java:903)
java.lang.RuntimeException: Input length = 1
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:620)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:281)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:249)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:26)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:280)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)
        ... 19 more
java.lang.RuntimeException: Input length = 1
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:417)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:109)
        at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
        at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
        at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:73)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
        at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
        at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
        at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:99)
        at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:64)
        at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:50)
        at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:45)
        at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:34)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:55)
        at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:29)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
        at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
        at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
        at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read1(BufferedReader.java:212)
        at java.io.BufferedReader.read(BufferedReader.java:286)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:443)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextQuotedString(CsvDecoder.java:813)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:659)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser._handleUnnamedValue(CsvParser.java:903)
        at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:620)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:281)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:249)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:26)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:280)
        at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)
        ... 19 more
jhpoelen commented 2 years ago

@pgasu I was able to reproduce the issue you found. To help identify the suspicious resource, I added some logging information to the upcoming Preston release. Note that in the error message below, the hash uri (or content id) is included. Now that we know the content ids, the exact provenance can be derived of the suspicious resources can be more easily derived.

[main] WARN bio.guoda.preston.process.DwcRecordExtractor - suspicious DwC resource [hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89!/occurrence.txt]
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:87)
    at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
    at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:62)
    at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
    at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:55)
    at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:44)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
    at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
    at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
    at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.lang.RuntimeException: Truncated ZIP file
    at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:417)
    at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:109)
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:83)
    ... 16 more
Caused by: java.io.IOException: Truncated ZIP file
    at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readDeflated(ZipArchiveInputStream.java:636)
    at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:528)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.read1(BufferedReader.java:212)
    at java.io.BufferedReader.read(BufferedReader.java:286)
    at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:443)
    at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextUnquotedString(CsvDecoder.java:764)
    at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:714)
    at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser._handleUnnamedValue(CsvParser.java:903)
    at org.gbif.common.shaded.com.fasterxml.jackson.dataformat.csv.CsvParser.nextToken(CsvParser.java:620)
    at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:281)
    at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:249)
    at org.gbif.common.shaded.com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:26)
    at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:280)
    at org.gbif.common.shaded.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)
    ... 20 more
[main] WARN bio.guoda.preston.process.DwcRecordExtractor - suspicious DwC resource [hash://sha256/9e19b4069017c64a31d693fc5cf296e264c40aadaa475c35e36aea0e0348e6fb] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/9e19b4069017c64a31d693fc5cf296e264c40aadaa475c35e36aea0e0348e6fb!/DarwinCore.txt]
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:87)
    at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
    at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:62)
    at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
    at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:55)
    at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:44)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
    at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
    at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
    at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('V' (code 86)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 859, column: 193]
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:83)
    ... 16 more
Caused by: java.text.ParseException: Unexpected character ('V' (code 86)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 859, column: 193]
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
    ... 17 more
[main] WARN bio.guoda.preston.process.DwcRecordExtractor - suspicious DwC resource [hash://sha256/db5cd1d7c8738c6f5ad241878a4fdc9929b719e4261ef9e1b72bbe8275e6eb92] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/db5cd1d7c8738c6f5ad241878a4fdc9929b719e4261ef9e1b72bbe8275e6eb92!/DarwinCore.txt]
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:87)
    at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
    at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:62)
    at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
    at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:55)
    at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:44)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
    at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
    at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
    at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('P' (code 80)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 1961, column: 299]
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:83)
    ... 16 more
Caused by: java.text.ParseException: Unexpected character ('P' (code 80)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 1961, column: 299]
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
    ... 17 more
[main] WARN bio.guoda.preston.process.DwcRecordExtractor - suspicious DwC resource [hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c!/occurrence.txt]
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:87)
    at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:98)
    at bio.guoda.preston.process.DwcRecordExtractor.on(DwcRecordExtractor.java:62)
    at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
    at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:55)
    at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:44)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
    at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:18)
    at bio.guoda.preston.cmd.CmdLine.run(CmdLine.java:26)
    at bio.guoda.preston.Preston.main(Preston.java:14)
Caused by: java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('"' (code 34)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 2, column: 24]
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:83)
    ... 16 more
Caused by: java.text.ParseException: Unexpected character ('"' (code 34)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 2, column: 24]
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
    ... 17 more
jhpoelen commented 2 years ago

The follow content ids were causing the issues seen

hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89

was retrieved from http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021 on 2019-09-01T11:10:04.772Z .

from provenance:

$ preston history --log tsv | head -n38 | tail -n1 | cut -f3 | preston cat | grep --after 10 --before 10 hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89
<hash://sha256/85997b54f8ad86e21b3b15ed392cd30280c0965feb2a63b467641f22b3c51db4> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/85997b54f8ad86e21b3b15ed392cd30280c0965feb2a63b467641f22b3c51db4> <http://www.w3.org/ns/prov#qualifiedGeneration> <c0d6631a-1ec6-4bb1-a429-077791e63d47> .
<c0d6631a-1ec6-4bb1-a429-077791e63d47> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<c0d6631a-1ec6-4bb1-a429-077791e63d47> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<c0d6631a-1ec6-4bb1-a429-077791e63d47> <http://www.w3.org/ns/prov#used> <http://ipt-inpn.gbif.fr/eml.do?r=IPT-6386AA2E-7597-5FE3-E053-2614A8C00573> .
<http://ipt-inpn.gbif.fr/eml.do?r=IPT-6386AA2E-7597-5FE3-E053-2614A8C00573> <http://purl.org/pav/hasVersion> <hash://sha256/85997b54f8ad86e21b3b15ed392cd30280c0965feb2a63b467641f22b3c51db4> .
<hash://sha256/14afc84936c197ab0ce367dc20845aef9f85ba9c7c687f3ec90b8e6306de706c> <http://www.w3.org/ns/prov#hadMember> <66551429-5e35-4848-94f5-b13bcad669a5> .
<66551429-5e35-4848-94f5-b13bcad669a5> <http://www.w3.org/1999/02/22-rdf-syntax-ns#seeAlso> <https://doi.org/10.15468/3kmwvz> .
<66551429-5e35-4848-94f5-b13bcad669a5> <http://www.w3.org/ns/prov#hadMember> <http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> .
<http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> <http://purl.org/dc/elements/1.1/format> "application/dwca" .
<hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89> <http://www.w3.org/ns/prov#generatedAtTime> "2019-09-01T11:10:04.772Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89> <http://www.w3.org/ns/prov#qualifiedGeneration> <726f1fa3-013d-4aaf-88f1-3b7913ef0b2b> .
<726f1fa3-013d-4aaf-88f1-3b7913ef0b2b> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<726f1fa3-013d-4aaf-88f1-3b7913ef0b2b> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<726f1fa3-013d-4aaf-88f1-3b7913ef0b2b> <http://www.w3.org/ns/prov#used> <http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> .
<http://ipt-inpn.gbif.fr/archive.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> <http://purl.org/pav/hasVersion> <hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89> .
<66551429-5e35-4848-94f5-b13bcad669a5> <http://www.w3.org/ns/prov#hadMember> <http://ipt-inpn.gbif.fr/eml.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> .
<http://ipt-inpn.gbif.fr/eml.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> <http://purl.org/dc/elements/1.1/format> "application/eml" .
<hash://sha256/587574aa16ade59c5320f403fccf3410cef403a21a24c4f830f336d8a7965484> <http://www.w3.org/ns/prov#generatedAtTime> "2019-09-01T11:10:04.901Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/587574aa16ade59c5320f403fccf3410cef403a21a24c4f830f336d8a7965484> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/587574aa16ade59c5320f403fccf3410cef403a21a24c4f830f336d8a7965484> <http://www.w3.org/ns/prov#qualifiedGeneration> <cf62d96f-14fd-45bf-8847-5cccecdcb68c> .
<cf62d96f-14fd-45bf-8847-5cccecdcb68c> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<cf62d96f-14fd-45bf-8847-5cccecdcb68c> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<cf62d96f-14fd-45bf-8847-5cccecdcb68c> <http://www.w3.org/ns/prov#used> <http://ipt-inpn.gbif.fr/eml.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> .
<http://ipt-inpn.gbif.fr/eml.do?r=IPT-6C62C6AA-288B-5AE0-E053-2614A8C0C021> <http://purl.org/pav/hasVersion> <hash://sha256/587574aa16ade59c5320f403fccf3410cef403a21a24c4f830f336d8a7965484> .
<hash://sha256/14afc84936c197ab0ce367dc20845aef9f85ba9c7c687f3ec90b8e6306de706c> <http://www.w3.org/ns/prov#hadMember> <1fec2829-4ac0-4329-b448-6f768b8d3427> .

(truncated zip) (see attached)

hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c (unexpected character)

was retrieved from https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E on 2019-09-01T12:49:22.050Z . See also https://www.gbif.org/dataset/62d82928-dc6f-40dc-85b3-f2be47e7b49a and https://doi.org/10.15468/17e8en .

from provenance -

$ preston history --log tsv | head -n38 | tail -n1 | cut -f3 | preston cat | grep --after 10 --before 10 hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c 
<hash://sha256/7d4385886aa3e0ebd3533651b5fd9a1d6c06738fd66a87a20ec502aee077bfed> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/7d4385886aa3e0ebd3533651b5fd9a1d6c06738fd66a87a20ec502aee077bfed> <http://www.w3.org/ns/prov#qualifiedGeneration> <c2b5356e-d594-4be4-abbf-eca37a66607e> .
<c2b5356e-d594-4be4-abbf-eca37a66607e> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<c2b5356e-d594-4be4-abbf-eca37a66607e> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<c2b5356e-d594-4be4-abbf-eca37a66607e> <http://www.w3.org/ns/prov#used> <http://tb.plazi.org/GgServer/dwca/FFB3092B7068BC729535FFB9FFA05F36.zip> .
<http://tb.plazi.org/GgServer/dwca/FFB3092B7068BC729535FFB9FFA05F36.zip> <http://purl.org/pav/hasVersion> <hash://sha256/7d4385886aa3e0ebd3533651b5fd9a1d6c06738fd66a87a20ec502aee077bfed> .
<hash://sha256/afdbe7a01bb39c510f0dee1c467541447e388adc7a35dd2d343029ed281f490a> <http://www.w3.org/ns/prov#hadMember> <62d82928-dc6f-40dc-85b3-f2be47e7b49a> .
<62d82928-dc6f-40dc-85b3-f2be47e7b49a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#seeAlso> <https://doi.org/10.15468/17e8en> .
<62d82928-dc6f-40dc-85b3-f2be47e7b49a> <http://www.w3.org/ns/prov#hadMember> <https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E> .
<https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E> <http://purl.org/dc/elements/1.1/format> "application/dwca" .
<hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c> <http://www.w3.org/ns/prov#generatedAtTime> "2019-09-01T12:49:22.050Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c> <http://www.w3.org/ns/prov#qualifiedGeneration> <4a553dbd-6d85-4cc4-9d02-db284af50b92> .
<4a553dbd-6d85-4cc4-9d02-db284af50b92> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<4a553dbd-6d85-4cc4-9d02-db284af50b92> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<4a553dbd-6d85-4cc4-9d02-db284af50b92> <http://www.w3.org/ns/prov#used> <https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E> .
<https://drive.google.com/uc?export=download&id=0B-BbRNUB6S7aSVdYSjNpbkZsM1E> <http://purl.org/pav/hasVersion> <hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c> .
<hash://sha256/afdbe7a01bb39c510f0dee1c467541447e388adc7a35dd2d343029ed281f490a> <http://www.w3.org/ns/prov#hadMember> <09cac3e8-42f4-41cf-a121-488602a8429d> .
<09cac3e8-42f4-41cf-a121-488602a8429d> <http://www.w3.org/1999/02/22-rdf-syntax-ns#seeAlso> <https://doi.org/10.5281/zenodo.1048224> .
<09cac3e8-42f4-41cf-a121-488602a8429d> <http://www.w3.org/ns/prov#hadMember> <http://tb.plazi.org/GgServer/dwca/FFCC8B0CFFA3FF80A43F537CFFC6F93F.zip> .
<http://tb.plazi.org/GgServer/dwca/FFCC8B0CFFA3FF80A43F537CFFC6F93F.zip> <http://purl.org/dc/elements/1.1/format> "application/dwca" .
<hash://sha256/a0cc1850821e1a9914e24fb7d6dff2bd9bb352afc269ed0c67220017cd520d44> <http://www.w3.org/ns/prov#generatedAtTime> "2019-09-01T12:49:22.112Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<hash://sha256/a0cc1850821e1a9914e24fb7d6dff2bd9bb352afc269ed0c67220017cd520d44> <http://www.w3.org/ns/prov#wasGeneratedBy> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<hash://sha256/a0cc1850821e1a9914e24fb7d6dff2bd9bb352afc269ed0c67220017cd520d44> <http://www.w3.org/ns/prov#qualifiedGeneration> <cce9fc0e-9eb5-4919-b351-fb32ff077020> .
<cce9fc0e-9eb5-4919-b351-fb32ff077020> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> .
<cce9fc0e-9eb5-4919-b351-fb32ff077020> <http://www.w3.org/ns/prov#activity> <501dadd3-0468-4a12-b6a9-1cd6c228be70> .
<cce9fc0e-9eb5-4919-b351-fb32ff077020> <http://www.w3.org/ns/prov#used> <http://tb.plazi.org/GgServer/dwca/FFCC8B0CFFA3FF80A43F537CFFC6F93F.zip> .
jhpoelen commented 2 years ago

In validating the suspicious dwca file with content id hash://sha256/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c

we found that the GBIF validator didn't like the content served via https://deeplinker.bio/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c either.

from https://www.gbif.org/tools/data-validator/1a026a9d-7441-457b-a032-1b212cb6d809 - Screenshot from 2022-04-13 15-24-12 Screenshot from 2022-04-13 15-26-11

However, the content does appear to have a valid zip structure:

$ curl "https://deeplinker.bio/d47fa5353f0a5a78ac0eb74db37f89f1ac255b8065f75bc35bc124c62047035c" > tmp/some.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9694k  100 9694k    0     0  3923k      0  0:00:02  0:00:02 --:--:-- 3923k
$ unzip -l tmp/some.zip 
Archive:  tmp/some.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     6635  2017-03-14 09:53   eml.xml
     2583  2017-02-21 08:38   meta.xml
 93838675  2017-02-22 17:46   occurrence.txt
---------                     -------
 93847893                     3 files

In looking up offending line using

$ preston cat --remote https://deeplinker.bio 'line:zip:hash://sha256/9e19b4069017c64a31d693fc5cf296e264c40aadaa475c35e36aea0e0348e6fb!/DarwinCore.txt!/L859' 
"39393998","2019-04-18T14:40:14Z","Type of sighting=Sighting of live animal;County=Louth;Vice-county=Louth;Abundance=One;Record comment=The sensor light came on and I looked up from reading "Village" magazine to see this wee beaut walk past.","MammalsOfIreland2016-2025","MammalsOfIreland2016-2025","97209583948267984","Brendan McSherry",,"2017-02-23",,"Mullatee","54.0301013900","-6.1579993300","100","Meles meles","Species","(Linnaeus, 1758)","accepted","Eurasian Badger","ie.nbdc.dataset.MammalsOfIreland2016-2025.97209583948267984"

it appears that the double quote is not properly escaped (e.g., [...] reading "Village" magazine [...])

jhpoelen commented 2 years ago

unzip supports that hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89 is not a well formed zip file.

$ preston cat --remote https://deeplinker.bio hash://sha256/bb95b5357711b8652ef0ea4dc36bfa48e5355ed5ab574b7c0b7e333b60232c89 > badzip.zip
$ unzip -l badzip.zip 
Archive:  badzip.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of badzip.zip or
        badzip.zip.zip, and cannot find badzip.zip.ZIP, period.
jhpoelen commented 2 years ago

@pgasu so, it appears that the issues you noticed are actual issues with the integrity of the tracked content.

If you'd like you can further trace the provenance and contact the owners of the data to resolve the issue.

jhpoelen commented 2 years ago

and preston will keep processing an archive until it hits a snag, then skips to the next archive.

Please advise on how to proceed.

jhpoelen commented 2 years ago

Note that similarly:

$ preston cat --remote https://deeplinker.bio 'line:zip:hash://sha256/db5cd1d7c8738c6f5ad241878a4fdc9929b719e4261ef9e1b72bbe8275e6eb92!/DarwinCore.txt!/L1961'
[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 100.0% of 176 kB at 0.51 MB/s completed in < 1 minute
"37647418","2018-07-30T13:40:26Z","Determiner name=Phil Melly;Sampling method=Daytime observation;Record comment=On observation, consulted internet and is satisfied this was indeed a Hummingbird Hawk Moth.  Couldn't believe was watching a Hummingbird in Ireland!  Feeding that day on Verbena and "Petunia Million Bells".","MothRecordsOfIreland","MothRecordsOfIreland","bc3eac8a-5507-11e7-a40d-00155d018200","Phil Melly",,"2017-05-31",,"Bogganstown Culmullin","53.4746220300","-6.6114130600","100","Macroglossum stellatarum","Species","(Linnaeus, 1758)","accepted","Humming-bird Hawk-moth","ie.nbdc.dataset.MothRecordsOfIreland.bc3eac8a-5507-11e7-a40d-00155d018200"

where un-escaped double quotes inside the darwin core archive (i.e. [...] Feeding that day on Verbena and "Petunia Million Bells".","Mot [...]) causes the csv parser to complain.

The associated url was http://gbif.biodiversityireland.ie/MothRecordsOfIreland.zip GBIF dataset https://www.gbif.org/dataset/be6e251d-cdf4-4784-8054-9f4be19ce3d9 and a related eml.xml is:

$ preston cat --remote https://deeplinker.bio 'zip:hash://sha256/db5cd1d7c8738c6f5ad241878a4fdc9929b719e4261ef9e1b72bbe8275e6eb92!/eml.xml'
[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 27.1% of 176 kB at 0.38 MB/[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 54.3% of 176 kB at 0.64 MB/[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 81.5% of 176 kB at 0.56 MB/[https://deeplinker.bio/d...61ef9e1b72bbe8275e6eb92] 100.0% of 176 kB at 0.66 MB/s completed in < 1 minute
<?xml version="1.0" encoding="utf-8"?>
<eml:eml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" packageId="ie.nbdc.dataset.MothRecordsOfIreland" system="http://gbif.biodiversityireland.ie/" scope="system" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml.xsd" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1">
  <dataset>
    <alternateIdentifier>MothRecordsOfIreland</alternateIdentifier>
    <title>Moth Records of Ireland</title>
    <creator>
      <individualName>
        <surName>Michael O'Donnell</surName>
      </individualName>
      <organizationName>Collated by the National Biodiversity Data Centre from different sources</organizationName>
      <address>
        <country>Ireland</country>
      </address>
      <electronicMailAddress>micealodonnell@eircom.net</electronicMailAddress>
      <onlineUrl>http://www.mothsireland.com/</onlineUrl>
    </creator>
    <metadataProvider>
      <organizationName>National Biodiversity Data Centre, Ireland</organizationName>
      <address>
        <deliveryPoint>Beechfield house, Carriganore WIT West Campus</deliveryPoint>
        <city>Waterford</city>
        <administrativeArea>County Waterford</administrativeArea>
        <country>Ireland</country>
      </address>
      <phone>+353 (0)51 306 240</phone>
      <electronicMailAddress>info@biodiversityireland.ie</electronicMailAddress>
      <onlineUrl>http://www.biodiversityireland.ie/</onlineUrl>
    </metadataProvider>
    <pubDate>2018-07-30Z</pubDate>
    <language>en</language>
    <abstract>
      <para>Moth records collated from a variety of sources</para>
    </abstract>
    <additionalInfo>
      <para>http://www.mothsireland.com/</para>
    </additionalInfo>
    <intellectualRights>
      <para>This work is licensed under a<ulink url="http://creativecommons.org/licenses/by/4.0/legalcode"><citetitle>Creative Commons Attribution (CC-BY) 4.0 License</citetitle></ulink></para>
    </intellectualRights>
    <distribution>
      <online>
        <url function="information">http://maps.biodiversityireland.ie/DataSet/268</url>
      </online>
    </distribution>
    <coverage>
      <geographicCoverage>
        <geographicDescription>The island of Ireland</geographicDescription>
        <boundingCoordinates>
          <westBoundingCoordinate>-10.5383265100</westBoundingCoordinate>
          <eastBoundingCoordinate>-5.8212730500</eastBoundingCoordinate>
          <northBoundingCoordinate>55.3117198200</northBoundingCoordinate>
          <southBoundingCoordinate>51.4314347500</southBoundingCoordinate>
        </boundingCoordinates>
      </geographicCoverage>
      <temporalCoverage>
        <rangeOfDates>
          <beginDate>
            <calendarDate>2007</calendarDate>
          </beginDate>
          <endDate>
            <calendarDate>2017</calendarDate>
          </endDate>
        </rangeOfDates>
      </temporalCoverage>
    </coverage>
    <purpose>
      <para>Record and understand the geographic distribution of moths in Ireland</para>
    </purpose>
    <contact>
      <individualName>
        <givenName>Barry</givenName>
        <surName>O'Neill</surName>
      </individualName>
      <organizationName>National Biodiversity Data Centre, Ireland</organizationName>
      <address>
        <deliveryPoint>Beechfield house, Carriganore WIT West Campus</deliveryPoint>
        <city>Waterford</city>
        <administrativeArea>County Waterford</administrativeArea>
        <country>Ireland</country>
      </address>
      <phone>+353 (0)51 306 240</phone>
      <electronicMailAddress>boneill@biodiversityireland.ie</electronicMailAddress>
      <onlineUrl>http://www.biodiversityireland.ie/</onlineUrl>
    </contact>
    <methods>
      <methodStep>
        <description>
          <para>Field observations supported by photographs where necessary</para>
        </description>
      </methodStep>
      <qualityControl>
        <description>
          <para>All records validated prior to submission to MothsIreland</para>
        </description>
      </qualityControl>
    </methods>
  </dataset>
  <additionalMetadata>
    <metadata>
      <gbif>
        <dateStamp>2018-07-30T14:43:18.6569988+01:00</dateStamp>
        <citation>National Biodiversity Data Centre: Collated by the National Biodiversity Data Centre from different sources - Moth Records of Ireland. Dataset/Occurrence.</citation>
      </gbif>
    </metadata>
  </additionalMetadata>
</eml:eml>
jhpoelen commented 2 years ago

@pgasu I just released preston v0.3.7 . This release includes a fix for the #161 and improves the error logging when encountering funny dwca (truncated zip files, invalid csv files).

So, as far as I can tell, your issues have been addressed. It is now up to the data providers to fix their data archives.

pgasu commented 2 years ago

Thanks @jhpoelen. This looks great.