dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
860 stars 270 forks source link

DBpedia NIF: invalid IRI escape (\n) #617

Open m1ci opened 4 years ago

m1ci commented 4 years ago

While converting the nif-text-links_lang=en.ttl from RDF to HDT using https://github.com/rdfhdt/hdt-cpp/tree/develop/libhdt I get following error:

error: /data/milan/nif-text-links_lang=en.ttl:7388119:282: invalid IRI escape

nif-text-links_lang=en.ttl comes from https://databus.dbpedia.org/marvin/text/nif-text-links/ version 2020.02.01

The problem is the \n in the following tripple:

<http://dbpedia.org/resource/Gospel_of_Matthew?dbpv=2020-02&nif=phrase&char=19391,19436> <http://www.w3.org/2005/11/its/rdf#taIdentRef> <https://web.archive.org/web/20150923184503/http://www.biblicalwritings.com/the-oxford-dictionary-of-the-christian-church/%3Falfa=M&word=Matthew,\nGospel+acc.+to+St.+> .

... right before Gospel+acc.+to+

JJ-Author commented 4 years ago

confirmed

ERR@284 Illegal unicode escape sequence value: \n (0x6E) using http://akswnc7.informatik.uni-leipzig.de:8088/

@m1ci have you tried https://databus.dbpedia.org/dbpedia/text/nif-text-links/ this should be parsed, so the issue is valid, but this triple should be excluded in dbpedia release.

m1ci commented 4 years ago

@JJ-Author yup, looked at https://databus.dbpedia.org/dbpedia/text/nif-text-links/ but the English partition nif-text-links_lang=en.ttl has not been published (not found under dbpedia but only under marvin). Probably something has interrupted the process.

m1ci commented 4 years ago

@JJ-Author English is already published/found on the databus. So all good now.

As for the invalid IRI escape (\n) @Vehnem can we write a construct validation test for this so that we avoid such problems in the future?