bio-guoda / preston-sernec

Prototype to help track records and other content associated with South East Regional Network of Expertise and Collections (SERNEC) Thematic Collection Network (TCN), a collaboration that is digitizing and making data accessible for over 3 million plant specimens.
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

unexpected content found #1

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

while tracking SERNEC dwca's (see also https://github.com/bio-guoda/preston/issues/212) , the following results was seen on

preston head

yielding

hash://sha256/bfeb3bb5744ce426c91c274ddf1a739205556c4c5b4fe8ff8081ad65c7cff1ce
preston head\
 | preston cat\
 | preston dwc-stream
main] WARN bio.guoda.preston.cmd.DwcRecordExtractor - suspicious DwC resource [hash://sha256/6030ec10db9a8e86bbfa9a8e4e0f755097a1444218a6f21641694d962f873a04] caused errors in processing
bio.guoda.preston.stream.ContentStreamException: failed to process entry [meta.xml] in [hash://sha256/6030ec10db9a8e86bbfa9a8e4e0f755097a1444218a6f21641694d962f873a04]
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:71)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:33)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:38)
    at bio.guoda.preston.cmd.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:103)
    at bio.guoda.preston.cmd.DwcRecordExtractor.on(DwcRecordExtractor.java:67)
    at bio.guoda.preston.cmd.CmdDwcRecordStream$1.emit(CmdDwcRecordStream.java:47)
    at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:57)
    at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:46)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:52)
    at bio.guoda.preston.cmd.CmdDwcRecordStream.run(CmdDwcRecordStream.java:27)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at bio.guoda.preston.Preston.run(Preston.java:92)
    at bio.guoda.preston.Preston.main(Preston.java:82)
Caused by: bio.guoda.preston.stream.ContentStreamException: failed to handle dwc records from [zip:hash://sha256/6030ec10db9a8e86bbfa9a8e4e0f755097a1444218a6f21641694d962f873a04!/occurrences.csv]
    at org.gbif.dwc.DwCArchiveStreamHandler.streamRecords(DwCArchiveStreamHandler.java:106)
    at org.gbif.dwc.DwCArchiveStreamHandler.handleAssumedDwCArchive(DwCArchiveStreamHandler.java:73)
    at org.gbif.dwc.DwCArchiveStreamHandler.handle(DwCArchiveStreamHandler.java:55)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:38)
    at bio.guoda.preston.cmd.DwcRecordExtractor$MyContentStreamHandlerImpl.handle(DwcRecordExtractor.java:103)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:63)
    ... 19 more
Caused by: java.lang.IllegalStateException: java.text.ParseException: Unexpected character ('o' (code 111)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 15028, column: 888]
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:85)
    at org.gbif.dwc.DwCArchiveStreamHandler.streamRecords(DwCArchiveStreamHandler.java:102)
    ... 24 more
Caused by: java.text.ParseException: Unexpected character ('o' (code 111)): Expected column separator character (',' (code 44)) or end-of-line
 at [Source: (BufferedReader); line: 15028, column: 888]
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
    at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
    at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:78)
    ... 25 more
jhpoelen commented 1 year ago

related to resource description found using:

preston cat 'zip:hash://sha256/6030ec10db9a8e86bbfa9a8e4e0f755097a1444218a6f21641694d962f873a04!/eml.xml'\
 | xmllint -format -

yielding:

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:dc="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.0.1/eml.xsd" packageId="bb2eef13-701c-48c4-b16d-c82ada0909fb" system="https://symbiota.org" scope="system" xml:lang="eng">
  <dataset>
    <alternateIdentifier>https://sernecportal.org/portal/collections/misc/collprofiles.php?collid=424</alternateIdentifier>
    <title xml:lang="eng">Great Smoky Mountains National Park</title>
    <creator id="17dd4489-ee9c-414a-b99b-871873f3d335">
      <organizationName>SERNEC</organizationName>
      <electronicMailAddress>herrick.brown@gmail.com</electronicMailAddress>
      <onlineUrl>https://sernecportal.org/portal/index.php</onlineUrl>
    </creator>
    <metadataProvider>
      <organizationName>SERNEC</organizationName>
      <electronicMailAddress>herrick.brown@gmail.com</electronicMailAddress>
      <onlineUrl>https://sernecportal.org/portal/index.php</onlineUrl>
    </metadataProvider>
    <pubDate>2023-01-03</pubDate>
    <language>eng</language>
    <contact>
      <organizationName>Great Smoky Mountains National Park</organizationName>
      <electronicMailAddress>Baird_Todd@nps.gov</electronicMailAddress>
      <onlineUrl>https://www.nps.gov/grsm/learn/nature/workspace_collections.htm</onlineUrl>
    </contact>
    <associatedParty>
      <individualName>
        <surName>Todd</surName>
        <givenName>Baird</givenName>
      </individualName>
      <electronicMailAddress>Baird_Todd@nps.gov</electronicMailAddress>
      <positionName>Museum Curator</positionName>
      <role>contentProvider</role>
    </associatedParty>
    <intellectualRights>
      <para>To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the <ulink url="http://creativecommons.org/licenses/by-nc/3.0/"><citetitle/></ulink></para>
    </intellectualRights>
  </dataset>
  <additionalMetadata>
    <metadata>
      <symbiota id="17dd4489-ee9c-414a-b99b-871873f3d335">
        <dateStamp>2023-01-03T16:55:24-07:00</dateStamp>
        <citation identifier="b10561a4-179b-4080-971d-01942d352e69">SERNEC - b10561a4-179b-4080-971d-01942d352e69</citation>
        <physical>
          <characterEncoding>UTF-8</characterEncoding>
          <dataFormat>
            <externallyDefinedFormat>
              <formatName>Darwin Core Archive</formatName>
            </externallyDefinedFormat>
          </dataFormat>
        </physical>
        <collection identifier="0552b141-a79c-4817-8558-edf1ce44e71a" id="424">
          <alternateIdentifier>https://sernecportal.org/portal/collections/misc/collprofiles.php?collid=424</alternateIdentifier>
          <parentCollectionIdentifier>GSMNP</parentCollectionIdentifier>
          <collectionIdentifier/>
          <collectionName>Great Smoky Mountains National Park</collectionName>
          <resourceLogoUrl>https://sernecportal.org/portal/content/collicon/gsmnp.png</resourceLogoUrl>
          <onlineUrl>https://www.nps.gov/grsm/learn/nature/workspace_collections.htm</onlineUrl>
          <intellectualRights>http://creativecommons.org/licenses/by-nc/3.0/</intellectualRights>
          <abstract/>
          <associatedParty>
            <individualName>
              <surName>Todd</surName>
              <givenName>Baird</givenName>
            </individualName>
            <electronicMailAddress>Baird_Todd@nps.gov</electronicMailAddress>
            <positionName>Museum Curator</positionName>
          </associatedParty>
        </collection>
      </symbiota>
    </metadata>
  </additionalMetadata>
</eml:eml>