dice-group / Squirrel

Squirrel searches and collects Linked Data
Other
23 stars 19 forks source link

Validate correctness of IRIs #142

Closed adibaba closed 4 years ago

adibaba commented 4 years ago

Some RDF collections created by Squirrel contain invalid IRIs. Due to that, an import of the RDF data in Fuseki fails.

Examples

File: https://hobbitdata.informatik.uni-leipzig.de/OPAL/processed_datasets/mcloud/mcloud_27-04-2020.zip

ERROR [line: 48, col: 128] Bad character in IRI (space): <https://geoportal.kreis-guetersloh.de/WMS/Schulem/guest?REQUEST=GetCapabilities&SERVICE=predefined[space]...>
org.apache.jena.riot.RiotException: [line: 48, col: 128] Bad character in IRI (space): <https://geoportal.kreis-guetersloh.de/WMS/Schulem/guest?REQUEST=GetCapabilities&SERVICE=predefined[space]...>
        at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:147)
        at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
        at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
        at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:287)
        at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:281)
        at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:250)
        at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:191)
        at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
        at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:91)
        at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
        at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:191)
        at org.apache.jena.riot.RDFParser.read(RDFParser.java:352)
        at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:321)
        at org.apache.jena.riot.RDFParser.parse(RDFParser.java:295)
        at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
        at org.apache.jena.riot.RDFDataMgr.parseFromURI(RDFDataMgr.java:921)
        at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:711)
        at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:680)
        at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:668)
        at org.apache.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:143)
        at org.apache.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:109)
        at org.apache.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:241)
        at org.apache.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:176)
        at org.apache.jena.tdb.TDBLoader.load(TDBLoader.java:68)
        at tdb.tdbloader.loadQuads(tdbloader.java:133)
        at tdb.tdbloader.exec(tdbloader.java:101)
        at jena.cmd.CmdMain.mainMethod(CmdMain.java:93)
        at jena.cmd.CmdMain.mainRun(CmdMain.java:58)
        at jena.cmd.CmdMain.mainRun(CmdMain.java:45)
        at tdb.tdbloader.main(tdbloader.java:48)

Regarding that file there is a warning, not an error:

WARN  [line: 60, col: 24] Lexical form '2019-12-09T16:01:44.177Z' not valid for datatype XSD date
WARN  [line: 61, col: 24] Lexical form '2019-12-09T16:01:44.177Z' not valid for datatype XSD date

In another file, the following errors occur. Propably those are based on importing https://hobbitdata.informatik.uni-leipzig.de/OPAL/processed_datasets/govdata/govdata_13-02-2020.tar.gz

Result: failed with message "Parse error: [line: 819454, col: 115] Illegal character in IRI (codepoint 0x7B, '{'): "

Result: failed with message "Parse error: [line: 839488, col: 1 ] Broken IRI (newline): http://www.wesel.de/c1257e3500269489/files/wesel-de-2018-05-01-2018-05-31.zip/$file/wesel-de-2018-05-01-2018-05-31.zip?openelement"

Result: failed with message "Parse error: [line: 987159, col: 36] Bad character in IRI (bad character: '<'): "

Result: failed with message "Parse error: [line: 987162, col: 36] Bad character in IRI (bad character: '<'): "

Example datasets for those are: https://pastebin.com/raw/1MARc5uh

Proposed Solution

copied from discussion with mr:

  1. all analyzers should check RDF resources they create, e.g., via a method in an abstract analyzer class which creates the RDF resources
  2. In case this method detects issues, we should offer a basic strategy to handle the problem. [...] idea to simply encode the characters that create issues seems to be very good
MichaelRoeder commented 4 years ago

Suggested implementation

gsjunior86 commented 4 years ago

@MichaelRoeder , the ResourceFactory class has a private constructor, so it cannot be extended. Instead, i created the TripleEncoder class, which escape only the request parameters, by using the escaping rules from org.apache.jena.util.URIref