Closed jhpoelen closed 2 years ago
the crash appears to relate to cpio archives - https://en.wikipedia.org/wiki/Cpio .
After adding context to error logging, the content in which the error occurred was found:
[main] WARN bio.guoda.preston.stream.ArchiveStreamHandler - failed to process some entry in [line:zip:hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6!/name.tsv!/L185815]
java.lang.NumberFormatException: For input string: "76-e64"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Long.parseLong(Long.java:692)
at org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream.readAsciiLong(CpioArchiveInputStream.java:376)
when tracing their provenance,
preston ls | grep "hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6"
to content originated from:
<https://api.checklistbank.org/dataset/127379/archive.zip> <http://purl.org/pav/hasVersion> <hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6> <urn:uuid:6d44686a-ad03-4b28-8fb7-a898f26c0c44> .
associated with https://checklistbank.org/dataset/127379
or "Australian Faunal Directory 2022-10-12" created by @rdmpage according to (see attached screenshot)
In looking at the offending line:
preston cat --remote https://linker.bio 'line:zip:hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6!/name.tsv!/L185815'
the following is produced:
07070776-e643-47bf-afab-22da04e3fd9c Lasioglossum (Chilalictus) nefrens Walker, 1995 Walker species Lasioglossum Chilalictus nefrens I
So, it appears that, for some reason, apache's common-compress v1.20 , interprets the UUID partly as some number.
07070776-e643-47bf-afab-22da04e3fd9c
The root cause of this issue was that the UUID of some records started with the same "magic" bytes (e.g., UTF-8 070707, see https://github.com/bio-guoda/preston/commit/7b8309f73eff304c0e4f9bd66f457ad1b9357df0#diff-75010cab53467b9aed2237fd0c7490b8c6af019af372d93132faa17a352f9bf3R32) used to automatically detect CPIO archives https://en.wikipedia.org/wiki/Cpio .
Fix was introduced by explicitly not supporting CPIO when attempting to stream content from archives.
@rdmpage @mielliott another great example of how big datasets can cause unknown crash buttons to be pushed.
Apologies for the false alarm.
after introducing the fix, preston is able to print:
preston cat --remote https://linker.bio 'line:zip:hash://sha256/9af95ab26a1886db0f17b2102838a9b898a1627872590afb485044cb25d2a5c6!/name.tsv!/L185815,L185816'
07070776-e643-47bf-afab-22da04e3fd9c Lasioglossum (Chilalictus) nefrens Walker, 1995 Walker species Lasioglossum Chilalictus nefrens ICZN established 1995
d32cc892-9c46-4e81-8b62-71df1ce34437 Lasioglossum (Chilalictus) nefrens Walker, 1995 Walker species Lasioglossum Chilalictus nefrens ICZN established 71f8aa80-adab-4783-8c23-143903dc8213 1995 j
when running
yielded -