Problem with parsing dbPedia-urls (HTTPBasedSameAsRetriever)

Linux249 commented 7 years ago

Hi,

we create our own nif-files for D2KB experiments and test them with all given annotators (uploading it through the user interface). In our nif-files we link each annotations to wikidata. (e.g. itsrdf:taIdentRef https://www.wikidata.org/wiki/Q1558)

While running gerbil localy (1.2.5 - added a cache folder manuely) I found a lot of errors/infos, specialy this one:

2017-04-10 08:22:19,888 [pool-1-thread-14] INFO [org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever] - <Exception while sending request for "http://dbpedia.org/resource/Thomas_Sturges_"Tom"_Watson". Returning null.> java.lang.IllegalArgumentException: Illegal character in path at index 43: http://dbpedia.org/resource/Thomas_Sturges_"Tom"_Watson at java.net.URI.create(URI.java:852) at org.apache.http.client.methods.HttpGet.(HttpGet.java:69) at org.aksw.gerbil.http.AbstractHttpRequestEmitter.createGetRequest(AbstractHttpRequestEmitter.java:102) at org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever.requestModel(HTTPBasedSameAsRetriever.java:88) at org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever.retrieveSameURIs(HTTPBasedSameAsRetriever.java:57) at org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever.retrieveSameURIs(HTTPBasedSameAsRetriever.java:178) at org.aksw.gerbil.semantic.sameas.impl.DomainBasedSameAsRetrieverManager.retrieveSameURIs(DomainBasedSameAsRetrieverManager.java:70) at org.aksw.gerbil.semantic.sameas.impl.DomainBasedSameAsRetrieverManager.retrieveSameURIs(DomainBasedSameAsRetrieverManager.java:58) at org.aksw.gerbil.semantic.sameas.impl.UriFilteringSameAsRetrieverDecorator.retrieveSameURIs(UriFilteringSameAsRetrieverDecorator.java:56) at org.aksw.gerbil.semantic.sameas.impl.CrawlingSameAsRetrieverDecorator.addSameURIs(CrawlingSameAsRetrieverDecorator.java:72) at org.aksw.gerbil.semantic.sameas.impl.CrawlingSameAsRetrieverDecorator.retrieveSameURIs(CrawlingSameAsRetrieverDecorator.java:51) at org.aksw.gerbil.semantic.sameas.impl.cache.FileBasedCachingSameAsRetriever.retrieveSameURIs(FileBasedCachingSameAsRetriever.java:131) at org.aksw.gerbil.semantic.sameas.impl.AbstractSameAsRetrieverDecorator.addSameURIs(AbstractSameAsRetrieverDecorator.java:43) at org.aksw.gerbil.semantic.sameas.SameAsRetrieverUtils.addSameURIsToMarkings(SameAsRetrieverUtils.java:30) at org.aksw.gerbil.dataset.AbstractDatasetConfiguration.getPreparedDataset(AbstractDatasetConfiguration.java:75) at org.aksw.gerbil.dataset.AbstractDatasetConfiguration.getDataset(AbstractDatasetConfiguration.java:50) at org.aksw.gerbil.execute.ExperimentTask.run(ExperimentTask.java:102) at org.aksw.simba.topicmodeling.concurrent.workers.WorkerImpl.run(WorkerImpl.java:44) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.URISyntaxException: Illegal character in path at index 43: http://dbpedia.org/resource/Thomas_Sturges_"Tom"_Watson at java.net.URI$Parser.fail(URI.java:2848) at java.net.URI$Parser.checkChars(URI.java:3021) at java.net.URI$Parser.parseHierarchical(URI.java:3105) at java.net.URI$Parser.parse(URI.java:3053) at java.net.URI.(URI.java:588) at java.net.URI.create(URI.java:850) ... 20 more

Here is our nif-file: test_3_nowiki.txt Here ist the full log writen with start.sh > gerbil.log: gerbil.txt (the gerbil_data folder was deleted before becaus of the problem with the missing cache folder #192)

Maybe it isn't a problem with the code and there is a fast workaround?

Also the annotators: Babelfy, FOX, FRED, FREME NER, NERFGUN, xLisa-NER and xLisa-NGRAM failed with "The annotator caused too many single errors.". Can't say why - maybe their is a problem with our nif-file?

RicardoUsbeck commented 7 years ago

Hi,

welcome! Did you try to load your model with Jena RDF (model.read(File file))? Or with which tool did you create your NIF file? You could also try the validator here first https://github.com/NLP2RDF/software

Linux249 commented 7 years ago

Thanks for the link - i will try the validator. The nif-file are created with my own tool'/script writen in python/rdflib.

MichaelRoeder commented 7 years ago

Regarding the first part of the issue: These messages can be ignored. Some URIs contain characters that are not allowed in URLs. The HTTP based same as retrieval simply tries to dereference all URIs which leads to these errors from time to time.

Btw. you might want to use the index based same as retrieval. It is much faster than the dereferencing via HTTP and should remove most of these error messages.

Linux249 commented 7 years ago

@RicardoUsbeck: i tried the validator, the 3. approach works for me and there are no errors.

@MichaelRoeder : Thanks for the hint - i had to unzip files (don't know why this wasn't done by the start.sh) and know the INFOS(HTTPBasedSameAsRetriever) are gone.

Thas solve the problem

dice-group / gerbil

Problem with parsing dbPedia-urls (HTTPBasedSameAsRetriever) #193