bio-guoda / preston

a biodiversity dataset tracker
MIT License
26 stars 1 forks source link

preston dwc-stream crashes on EOL trait dwca #186

Open jhpoelen opened 2 years ago

jhpoelen commented 2 years ago

For some reason, an EoL trait datasets expressed in DwC-A is causing preston to choke:

bio.guoda.preston.process.DwcRecordExtractorTest,streamEncyclopediaOfLife
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:contributor and dc:contributor are both known as "contributor". Keeping only earlier dcterms:contributor
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:coverage and dc:coverage are both known as "coverage". Keeping only earlier dcterms:coverage
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:creator and dc:creator are both known as "creator". Keeping only earlier dcterms:creator
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:date and dc:date are both known as "date". Keeping only earlier dcterms:date
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:description and dc:description are both known as "description". Keeping only earlier dcterms:description
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:format and dc:format are both known as "format". Keeping only earlier dcterms:format
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:identifier and dc:identifier are both known as "identifier". Keeping only earlier dcterms:identifier
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:language and dc:language are both known as "language". Keeping only earlier dcterms:language
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:publisher and dc:publisher are both known as "publisher". Keeping only earlier dcterms:publisher
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:relation and dc:relation are both known as "relation". Keeping only earlier dcterms:relation
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:rights and dc:rights are both known as "rights". Keeping only earlier dcterms:rights
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:source and dc:source are both known as "source". Keeping only earlier dcterms:source
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:subject and dc:subject are both known as "subject". Keeping only earlier dcterms:subject
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:title and dc:title are both known as "title". Keeping only earlier dcterms:title
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:type and dc:type are both known as "type". Keeping only earlier dcterms:type
[main] INFO org.gbif.dwc.terms.TermFactory - Property terms dcterms:identifier and acef:ID are both known as "ID". Keeping only earlier dcterms:identifier
[main] INFO org.gbif.dwc.terms.TermFactory - Class terms gbif:Distribution and acef:Distribution are both known as "Distribution". Keeping only earlier gbif:Distribution
[main] INFO org.gbif.dwc.terms.TermFactory - Class terms gbif:Reference and acef:Reference are both known as "Reference". Keeping only earlier gbif:Reference
[main] INFO org.gbif.dwc.terms.TermFactory - Class terms gbif:Reference and acef:Reference are both known as "References". Keeping only earlier gbif:Reference
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file

java.lang.NullPointerException
    at org.gbif.dwc.DwCArchiveStreamHandler.getLocation(DwCArchiveStreamHandler.java:131)
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:76)
    at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.cmd.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:103)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.cmd.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:103)
    at bio.guoda.preston.cmd.DwcRecordExtractor.on(DwcRecordExtractor.java:67)
    at bio.guoda.preston.process.DwcRecordExtractorTest.streamEncyclopediaOfLife(DwcRecordExtractorTest.java:177)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
jhpoelen commented 2 years ago

related resource;

Dunn, C.W., Leys, S.P. and Haddock, S.H., 2015. The hidden biology of sponges and ctenophores. Trends in ecology & evolution, 30(5), pp.282-291. https://doi.org/10.1016/j.tree.2015.03.003

https://opentraits.org/datasets/dunn-et-al-2015

https://opendata.eol.org/dataset/7195e9d9-ad49-48f8-96df-95d2934cbf79/resource/3396fd83-87e0-4a90-8fa0-35cf5a070c55/download/dunnetal2015.zip

fyi @jhammock

jhpoelen commented 3 weeks ago

See related -

preston track https://opendata.eol.org/dataset/7195e9d9-ad49-48f8-96df-95d2934cbf79/resource/3396fd83-87e0-4a90-8fa0-35cf5a070c55/download/dunnetal2015.zip\
 | preston dwc-stream

For associated data, see attached data.zip package with signature hash://sha256/c5fcb6905ee43449c8ef27a678f12d34a48971cc7f8b84f5d4d87e53d71afb02 data.zip

producing:

[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN org.gbif.dwc.meta.MetaXMLSaxHandler2 - field found outside of an archive file
[main] WARN bio.guoda.preston.stream.ArchiveStreamHandler - failed to process <zip:hash://sha256/d64918485a1e83c4f9316915b7ec6878b167e68df66bb97d36af7df77e1661d8!/meta.xml>
java.lang.NullPointerException
    at org.gbif.dwc.DwCArchiveStreamHandler.getLocation(DwCArchiveStreamHandler.java:140)
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:77)
    at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:56)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:38)
    at bio.guoda.preston.cmd.DwcRecordExtractor$DwCStreamHandlerImpl.handle(DwcRecordExtractor.java:67)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:63)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:33)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:38)
    at bio.guoda.preston.cmd.DwcRecordExtractor$DwCStreamHandlerImpl.handle(DwcRecordExtractor.java:67)
    at bio.guoda.preston.cmd.ProcessorExtracting.on(ProcessorExtracting.java:53)
    at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
    at bio.guoda.preston.process.EmittingStreamAbstract.copyOnEmit(EmittingStreamAbstract.java:29)
    at bio.guoda.preston.process.EmittingStreamOfAnyVersions.parseAndEmit(EmittingStreamOfAnyVersions.java:35)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at bio.guoda.preston.Preston.run(Preston.java:105)
    at bio.guoda.preston.Preston.main(Preston.java:96)
[main] WARN bio.guoda.preston.cmd.ProcessorExtracting - suspicious resource [hash://sha256/d64918485a1e83c4f9316915b7ec6878b167e68df66bb97d36af7df77e1661d8] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to process <zip:hash://sha256/d64918485a1e83c4f9316915b7ec6878b167e68df66bb97d36af7df77e1661d8!/meta.xml>
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:73)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:33)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:38)
    at bio.guoda.preston.cmd.DwcRecordExtractor$DwCStreamHandlerImpl.handle(DwcRecordExtractor.java:67)
    at bio.guoda.preston.cmd.ProcessorExtracting.on(ProcessorExtracting.java:53)
    at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
    at bio.guoda.preston.process.EmittingStreamAbstract.copyOnEmit(EmittingStreamAbstract.java:29)
    at bio.guoda.preston.process.EmittingStreamOfAnyVersions.parseAndEmit(EmittingStreamOfAnyVersions.java:35)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at bio.guoda.preston.Preston.run(Preston.java:105)
    at bio.guoda.preston.Preston.main(Preston.java:96)
Caused by: java.lang.NullPointerException
    at org.gbif.dwc.DwCArchiveStreamHandler.getLocation(DwCArchiveStreamHandler.java:140)
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:77)
    at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:56)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:38)
    at bio.guoda.preston.cmd.DwcRecordExtractor$DwCStreamHandlerImpl.handle(DwcRecordExtractor.java:67)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:63)
    ... 19 more
jhpoelen commented 3 weeks ago

with associated meta.xml

preston cat\
 'zip:hash://sha256/d64918485a1e83c4f9316915b7ec6878b167e68df66bb97d36af7df77e1661d8!/meta.xml'

being

<?xml version="1.0"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://rs.tdwg.org/dwc/text/  http://services.eol.org/schema/dwca/tdwg_dwc_text.xsd">
  <table encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Taxon">
    <files><location>taxa.txt</location></files>
    <field index="0" term="http://rs.tdwg.org/dwc/terms/taxonID"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="2" term="http://eol.org/schema/EOLid"/>
  </table>
  <table encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files><location>occurrences.txt</location></files>
    <field index="0" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/taxonID"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/lifeStage"/>
  </table>
  <table encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/MeasurementOrFact">
    <files><location>measurementOrFact.txt</location></files>
    <field index="0" term="http://rs.tdwg.org/dwc/terms/measurementID"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    <field index="2" term="http://eol.org/schema/measurementOfTaxon"/>
    <field index="3" term="http://eol.org/schema/parentMeasurementID"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/measurementType"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/measurementValue"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/measurementRemarks"/>
    <field index="7" term="http://purl.org/dc/terms/source"/>
    <field index="8" term="http://purl.org/dc/terms/bibliographicCitation"/>
  </table>
</archive>