bio-guoda / preston

a biodiversity dataset tracker
MIT License
25 stars 1 forks source link

[preston grep] crash on malformed nquad #292

Closed jhpoelen closed 2 weeks ago

jhpoelen commented 2 weeks ago

on malformed nquad encountering via

preston cat 'line:hash://sha256/9afacaefe9732946005066b7cf5310020e3ab4f89a3c9e80b169e0982b5cb798!/L1105853'
<https://archive.org/download/studyofleaves00denn/studyofleaves00denn_djvu.txt> <http://purl.org/pav/hasVersion> <hash://sha256/8aaed7911f9fc<urn:uuid:cec9d97d-d0d0-47d3-8367-dfbb41e31ecf> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> <urn:uuid:cec9d97d-d0d0-47d3-8367-dfbb41e31ecf> .

crashed with

java.lang.IllegalArgumentException: Illegal character in path at index 27: hash://sha256/8aaed7911f9fc<urn:uuid:cec9d97d-d0d0-47d3-8367-dfbb41e31ecf
    at java.net.URI.create(URI.java:852)
    at org.apache.commons.rdf.simple.IRIImpl.<init>(IRIImpl.java:33)
    at org.apache.commons.rdf.simple.SimpleRDF.createIRI(SimpleRDF.java:82)
    at bio.guoda.preston.RefNodeFactory.toIRI(RefNodeFactory.java:25)
    at bio.guoda.preston.store.VersionUtil.mostRecentVersion(VersionUtil.java:114)
    at bio.guoda.preston.store.VersionUtil.getMostRecentContentId(VersionUtil.java:133)
    at bio.guoda.preston.process.EmittingStreamOfAnyVersions.parseAndEmit(EmittingStreamOfAnyVersions.java:32)
    at bio.guoda.preston.cmd.CmdGrep.run(CmdGrep.java:77)
    at bio.guoda.preston.cmd.CmdGrep.run(CmdGrep.java:48)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at bio.guoda.preston.Preston.run(Preston.java:103)
    at bio.guoda.preston.Preston.main(Preston.java:94)
Caused by: java.net.URISyntaxException: Illegal character in path at index 27: hash://sha256/8aaed7911f9fc<urn:uuid:cec9d97d-d0d0-47d3-8367-dfbb41e31ecf
    at java.net.URI$Parser.fail(URI.java:2847)
    at java.net.URI$Parser.checkChars(URI.java:3020)
    at java.net.URI$Parser.parseHierarchical(URI.java:3104)
    at java.net.URI$Parser.parse(URI.java:3052)
    at java.net.URI.<init>(URI.java:588)
    at java.net.URI.create(URI.java:850)
    ... 18 more
jhpoelen commented 2 weeks ago

Fixed by making version matching patterns more specific. Malformed nquads are now skipped by [preston grep] .