bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

preston grep crashes on java.lang.NumberFormatException: For input string: "76-e64" from CpioArchiveInputStream #200

Closed jhpoelen closed 2 years ago

jhpoelen commented 2 years ago

when running

preston cat --remote https://linker.bio hash://sha256/9656eb63e2d08d382224cee1e28361221adfacdbab10aba418beb67b686166dd\
 | preston grep --remote https://linker.bio "Aglais io"\
 | tee aglais-io-matches.nq\
 | pv -l\
 | grep "#value"

yielded -

[...]
758 false   ICZN    http://purl.obolibrary.org/obo/NOMEN_0000224        " <urn:uuid:c3bbe4cd-dc34-48a5-ae15-f8763395cf3a> .
<line:zip:hash://sha256/0ea08c031ac2774c652da70dd5f39898b6cd4545057971694dd4396cd93566c0!/Name.csv!/L359237> <http://www.w3.org/ns/prov#value> "278974  278974-a6deb3b89ef0deee528323aba5195e68 Aglais io geisha    (Stichel, 1908) subspecies      Aglais      io  geisha  1908    false   ICZN    http://purl.obolibrary.org/obo/NOMEN_0000224        " <urn:uuid:c3bbe4cd-dc34-48a5-ae15-f8763395cf3a> .
java.lang.NumberFormatException: For input string: "76-e64"                                     <=>                                             ]
    at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.base/java.lang.Long.parseLong(Long.java:692)
    at org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream.readAsciiLong(CpioArchiveInputStream.java:376)
    at org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream.readOldAsciiEntry(CpioArchiveInputStream.java:425)
    at org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream.getNextCPIOEntry(CpioArchiveInputStream.java:261)
    at org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream.getNextEntry(CpioArchiveInputStream.java:536)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:51)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.TextMatcher$MyContentStreamHandlerImpl.handle(TextMatcher.java:110)
    at bio.guoda.preston.stream.LineStreamHandler.extractLines(LineStreamHandler.java:53)
    at bio.guoda.preston.stream.LineStreamHandler.handle(LineStreamHandler.java:35)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.TextMatcher$MyContentStreamHandlerImpl.handle(TextMatcher.java:110)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handleArchiveEntries(ArchiveStreamHandler.java:60)
    at bio.guoda.preston.stream.ArchiveStreamHandler.handle(ArchiveStreamHandler.java:29)
    at bio.guoda.preston.stream.ContentStreamHandlerImpl.handle(ContentStreamHandlerImpl.java:35)
    at bio.guoda.preston.process.TextMatcher$MyContentStreamHandlerImpl.handle(TextMatcher.java:110)
    at bio.guoda.preston.process.TextMatcher.on(TextMatcher.java:71)
    at bio.guoda.preston.cmd.CmdGrep$1.emit(CmdGrep.java:72)
    at bio.guoda.preston.process.EmittingStreamRDF.copyOnEmit(EmittingStreamRDF.java:57)
    at bio.guoda.preston.process.EmittingStreamRDF.parseAndEmit(EmittingStreamRDF.java:46)
    at bio.guoda.preston.cmd.CmdGrep.run(CmdGrep.java:77)
    at bio.guoda.preston.cmd.CmdGrep.run(CmdGrep.java:48)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at bio.guoda.preston.Preston.run(Preston.java:91)
    at bio.guoda.preston.Preston.main(Preston.java:80)
 342k 1:00:00 [95.1 /s] [                                                                     <=>                                               ]
jhpoelen commented 2 years ago

the crash appears to relate to cpio archives - https://en.wikipedia.org/wiki/Cpio .

jhpoelen commented 2 years ago

After adding context to error logging, the content in which the error occurred was found:

[main] WARN bio.guoda.preston.stream.ArchiveStreamHandler - failed to process some entry in [line:zip:hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6!/name.tsv!/L185815]
java.lang.NumberFormatException: For input string: "76-e64"
    at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.base/java.lang.Long.parseLong(Long.java:692)
    at org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream.readAsciiLong(CpioArchiveInputStream.java:376)

when tracing their provenance,

preston ls | grep "hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6"

to content originated from:

<https://api.checklistbank.org/dataset/127379/archive.zip> <http://purl.org/pav/hasVersion> <hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6> <urn:uuid:6d44686a-ad03-4b28-8fb7-a898f26c0c44> .

associated with https://checklistbank.org/dataset/127379

or "Australian Faunal Directory 2022-10-12" created by @rdmpage according to (see attached screenshot)

Screenshot from 2022-11-04 10-43-56 Screenshot from 2022-11-04 10-43-42

jhpoelen commented 2 years ago

In looking at the offending line:

preston cat --remote https://linker.bio  'line:zip:hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6!/name.tsv!/L185815'

the following is produced:

07070776-e643-47bf-afab-22da04e3fd9c    Lasioglossum (Chilalictus) nefrens Walker, 1995 Walker  species     Lasioglossum    Chilalictus nefrens     I
jhpoelen commented 2 years ago

So, it appears that, for some reason, apache's common-compress v1.20 , interprets the UUID partly as some number.

07070776-e643-47bf-afab-22da04e3fd9c

jhpoelen commented 2 years ago

The root cause of this issue was that the UUID of some records started with the same "magic" bytes (e.g., UTF-8 070707, see https://github.com/bio-guoda/preston/commit/7b8309f73eff304c0e4f9bd66f457ad1b9357df0#diff-75010cab53467b9aed2237fd0c7490b8c6af019af372d93132faa17a352f9bf3R32) used to automatically detect CPIO archives https://en.wikipedia.org/wiki/Cpio .

Fix was introduced by explicitly not supporting CPIO when attempting to stream content from archives.

jhpoelen commented 2 years ago

@rdmpage @mielliott another great example of how big datasets can cause unknown crash buttons to be pushed.

Apologies for the false alarm.

jhpoelen commented 2 years ago

after introducing the fix, preston is able to print:

preston cat --remote https://linker.bio 'line:zip:hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6!/name.tsv!/L185815,L185816'
07070776-e643-47bf-afab-22da04e3fd9c    Lasioglossum (Chilalictus) nefrens Walker, 1995 Walker  species     Lasioglossum    Chilalictus nefrens     ICZN    established     1995    
d32cc892-9c46-4e81-8b62-71df1ce34437    Lasioglossum (Chilalictus) nefrens Walker, 1995 Walker  species     Lasioglossum    Chilalictus nefrens     ICZN    established 71f8aa80-adab-4783-8c23-143903dc8213    1995    j