dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
860 stars 270 forks source link

encoding of dbpedia uris #586

Open kurzum opened 5 years ago

kurzum commented 5 years ago

https://en.wikipedia.org/wiki/The_Ren_%26_Stimpy_Show is encoded as: https://dbpedia.org/resource/The_Ren_&_Stimpy_Show

check: curl http://dbpedia-mappings.tib.eu/release/mappings/mappingbased-literals/2019.06.01/mappingbased-literals_lang=en.ttl.bz2 | bzcat | cut -f1 -d '>' | grep '&'

on https://databus.dbpedia.org/marvin/mappings/mappingbased-literals/2019.06.01

LorenzBuehmann commented 5 years ago

It turns out that you created also a bunch of other weird and especially illegal IRIs, like in instance triples, as I think \" isn't allowed in IRIs:

<http://dbpedia.org/resource/Mini__\"Mark_I\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Automobile> .
<http://dbpedia.org/resource/Mini__\"Mark_I\"__1__AutomobileEngine__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/AutomobileEngine> .

where the main resource is http://dbpedia.org/resource/Mini_(Mark_I) so I don't understand why the first triple has a __1 appended which I thought is used to indicate some kind of event data?

http://dbpedia.org/resource/Mini__\"Mark_I\"__1 should be http://dbpedia.org/resource/Mini_(Mark_I)__1

There is also other data like

<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Sales> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.wikidata.org/entity/Q194189> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Situation> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Activity> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.ontologydesignpatterns.org/ont/d0.owl#Activity> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.wikidata.org/entity/Q1914636> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .

where I even don't know from which source this data is derived. At least not from the Wikipedia article https://en.wikipedia.org/wiki/N.EX.T unless you tried to extract data from the discography table on the bottom left?

In general, I can see at least two kind of those error patterns, one for the automobiles and one for "Sales" activity data from some music discography tables?

Anyways, I won't dig further into the data, but maybe you could run an RDF parser like Jena RIOT (riot --sink <file>) to avoid those kind of syntax errors before uploading/releasing new data?

This would also show you some more less critical warnings with the IRIs like

WARN riot :: [line: 9785, col: 1 ] Bad IRI: <http://dbpedia.org/resource/Ranma_½__film__1> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.

or problems with some of your literals (in the latest mapping-based-literals dataset) like

WARN riot :: [line: 5284, col: 85] Lexical form '-11' not valid for datatype XSD nonNegativeInteger

curl http://dbpedia-mappings.tib.eu/release/mappings/instance-types/2019.06.01/instance-types_lang=en.ttl.bz2 | bzcat | cut -f1 -d '>' | grep '__\\"'

once I fixed the issues with \", I got another parser error: ERROR riot :: [line: 1474505, col: 49] Illegal character in IRI (codepoint 0x60, ''): http://dbpedia.org/resource/Dahlak_SC__Lisau_A[`]...`

the line of error

<http://dbpedia.org/resource/Dahlak_SC__Lisau_A`ruda__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/SportsTeamMember> .

see the ` char in the IRI of the subject

LorenzBuehmann commented 5 years ago

In mappingbased-literals_en.ttl the line 2430526

<http://dbpedia.org/resource/Kerala_Agricultural_University> <http://xmlns.com/foaf/0.1/name> "\@en .

is broken as well. Look at the object ...

LorenzBuehmann commented 5 years ago

... and in mappingbased-objects-uncleaned_en you have thousands of triples with white spaces in the URI : grep '\w\s\w' mappingbased-objects-uncleaned_en.ttl For example line 293

<http://dbpedia.org/resource/Atlantic_Ocean> <http://xmlns.com/foaf/0.1/depiction> <http://en.wikipedia.org/wiki/Special:FilePath/Atlantic Ocean location map.svg> .
kurzum commented 5 years ago

@LorenzBuehmann we tried debugging this with RDF parsers, i.e. Sansa, but then we coded a more low-level validation: https://forum.dbpedia.org/t/new-ci-tests-on-dbpedia-releases/77/3

Here is a summary:

added

https://en.wikipedia.org/wiki/Special:Export/N.EX.T
https://en.wikipedia.org/wiki/Special:Export/Dahlak_SC
https://en.wikipedia.org/wiki/Special:Export/Atlantic_Ocean
https://en.wikipedia.org/wiki/Special:Export/Kerala_Agricultural_University
https://en.wikipedia.org/wiki/Special:Export/Mini_(Mark_I)
https://en.wikipedia.org/wiki/Special:Export/Ranma_½

to https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test/bash/uris.lst for minidump testing.

It is still work in progress:

  1. https://en.wikipedia.org/wiki/Special:Export/N.EX.T -> errors are recognized:

    10:05:24 | ERROR | ValidationExecutor$testIri | http://dbpedia.org/resource/N.EX.T__\"Here_I_Stand_For_You\"__1 contains bad sequence \
    10:05:24 | ERROR | ValidationExecutor$testIri | http://dbpedia.org/resource/N.EX.T__\"Here_I_Stand_For_You\"__1 contains bad sequence "
    10:05:24 | ERROR | ValidationExecutor$testIri | http://dbpedia.org/resource/N.EX.T__\"Here_I_Stand_For_You\"__1 contains bad sequence \
    10:05:24 | ERROR | ValidationExecutor$testIri | http://dbpedia.org/resource/N.EX.T__\"Here_I_Stand_For_You\"__1 contains bad sequence "
    Cov_s: 1.0 ( 18 triggered of 18 total ), Success_rate_s: 0.8888889 ( 16 )
    Cov_p: 0.9577465 ( 68 triggered of 71 total ), Success_rate_p: 1.0 ( 68 )
    Cov_o: 0.92890996 ( 196 triggered of 211 total ), Success_rate_o: 1.0 ( 196 )
    Cov:   0.9622188

    so we can look into it

  2. https://en.wikipedia.org/wiki/Special:Export/Dahlak_SC
    https://en.wikipedia.org/wiki/Special:Export/Mini_(Mark_I)

    seems like no __ data is extracted in our minidump, checking....

  3. <http://dbpedia.org/resource/Atlantic_Ocean> <http://xmlns.com/foaf/0.1/depiction> <http://en.wikipedia.org/wiki/Special:FilePath/Atlantic Ocean location map.svg> . @Vehnem not sure, why this trigger is not working, the error is not showing:

    
    # TODO trigger does not seem to work
    trigger:wikipedia
    a v:RDF_IRI_Trigger ;
    trigger:pattern "^http://en.wikipedia.org/wiki/" .

<#wikipedia_IRIs> a v:TestGenerator ; v:trigger trigger:wikipedia ;

same as dbpedia

v:validator validator:dissallowed_chars ; v:validator validator:dbpedia_resource_delims .

4. `https://en.wikipedia.org/wiki/Special:Export/Kerala_Agricultural_University`
foaf:name is now properly encoded

curl http://dbpedia-mappings.tib.eu/release/mappings/mappingbased-literals/2019.08.01/mappingbased-literals_lang=en.ttl.bz2 | bzcat | grep 'Kerala_Agricul' | grep foaf` http://dbpedia.org/resource/Kerala_Agricultural_University http://xmlns.com/foaf/0.1/name "Kerala Agricultural University (KAU)"@en . http://dbpedia.org/resource/Kerala_Agricultural_University http://xmlns.com/foaf/0.1/name "\"@en .


Issues are: it is not produced by the minidump any more, `\\` is valid, but doesn't make sense 

5. `https://en.wikipedia.org/wiki/Special:Export/Ranma_½`
In addition to the low-level testing there is a Sansa-Stack parser pass planned as well, which should cover the Unicode NFC issue. 

@LorenzBuehmann thanks for the feedback. It is not yet fixed, but we have a way to record these at least systematically now. 
kurzum commented 5 years ago

@LorenzBuehmann I fixed 1. the " problem:

Cov_s: 1.0 ( 18 triggered of 18 total ), Success_rate_s: 1.0 ( 18 )
Cov_p: 0.9577465 ( 68 triggered of 71 total ), Success_rate_p: 1.0 ( 68 )
Cov_o: 0.92890996 ( 196 triggered of 211 total ), Success_rate_o: 1.0 ( 196 )
Cov:   0.9622188
kuzeko commented 4 years ago

I've downloaded files

https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.03.01/categories_lang=en_skos.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.03.01/categories_lang=en_labels.ttl.bz2

And there are IRIs containing an unencoded space: '32'