Open kurzum opened 5 years ago
It turns out that you created also a bunch of other weird and especially illegal IRIs, like in instance triples, as I think \"
isn't allowed in IRIs:
<http://dbpedia.org/resource/Mini__\"Mark_I\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Automobile> .
<http://dbpedia.org/resource/Mini__\"Mark_I\"__1__AutomobileEngine__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/AutomobileEngine> .
where the main resource is http://dbpedia.org/resource/Mini_(Mark_I) so I don't understand why the first triple has a __1
appended which I thought is used to indicate some kind of event data?
http://dbpedia.org/resource/Mini__\"Mark_I\"__1 should be http://dbpedia.org/resource/Mini_(Mark_I)__1
There is also other data like
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Sales> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.wikidata.org/entity/Q194189> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Situation> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Activity> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.ontologydesignpatterns.org/ont/d0.owl#Activity> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.wikidata.org/entity/Q1914636> .
<http://dbpedia.org/resource/N.EX.T__\"I_Want_It_All_Demo_0.7_\"__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
where I even don't know from which source this data is derived. At least not from the Wikipedia article https://en.wikipedia.org/wiki/N.EX.T unless you tried to extract data from the discography table on the bottom left?
In general, I can see at least two kind of those error patterns, one for the automobiles and one for "Sales" activity data from some music discography tables?
Anyways, I won't dig further into the data, but maybe you could run an RDF parser like Jena RIOT (riot --sink <file>
) to avoid those kind of syntax errors before uploading/releasing new data?
This would also show you some more less critical warnings with the IRIs like
WARN riot :: [line: 9785, col: 1 ] Bad IRI: <http://dbpedia.org/resource/Ranma_½__film__1> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
or problems with some of your literals (in the latest mapping-based-literals dataset) like
WARN riot :: [line: 5284, col: 85] Lexical form '-11' not valid for datatype XSD nonNegativeInteger
curl http://dbpedia-mappings.tib.eu/release/mappings/instance-types/2019.06.01/instance-types_lang=en.ttl.bz2 | bzcat | cut -f1 -d '>' | grep '__\\"'
once I fixed the issues with \"
, I got another parser error:
ERROR riot :: [line: 1474505, col: 49] Illegal character in IRI (codepoint 0x60, '
'): http://dbpedia.org/resource/Dahlak_SC__Lisau_A[`]...`
the line of error
<http://dbpedia.org/resource/Dahlak_SC__Lisau_A`ruda__1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/SportsTeamMember> .
see the ` char in the IRI of the subject
In mappingbased-literals_en.ttl
the line 2430526
<http://dbpedia.org/resource/Kerala_Agricultural_University> <http://xmlns.com/foaf/0.1/name> "\@en .
is broken as well. Look at the object ...
... and in mappingbased-objects-uncleaned_en
you have thousands of triples with white spaces in the URI :
grep '\w\s\w' mappingbased-objects-uncleaned_en.ttl
For example line 293
<http://dbpedia.org/resource/Atlantic_Ocean> <http://xmlns.com/foaf/0.1/depiction> <http://en.wikipedia.org/wiki/Special:FilePath/Atlantic Ocean location map.svg> .
@LorenzBuehmann we tried debugging this with RDF parsers, i.e. Sansa, but then we coded a more low-level validation: https://forum.dbpedia.org/t/new-ci-tests-on-dbpedia-releases/77/3
Here is a summary:
https://en.wikipedia.org/wiki/Special:Export/N.EX.T
https://en.wikipedia.org/wiki/Special:Export/Dahlak_SC
https://en.wikipedia.org/wiki/Special:Export/Atlantic_Ocean
https://en.wikipedia.org/wiki/Special:Export/Kerala_Agricultural_University
https://en.wikipedia.org/wiki/Special:Export/Mini_(Mark_I)
https://en.wikipedia.org/wiki/Special:Export/Ranma_½
to https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test/bash/uris.lst for minidump testing.
It is still work in progress:
https://en.wikipedia.org/wiki/Special:Export/N.EX.T
-> errors are recognized:
10:05:24 | ERROR | ValidationExecutor$testIri | http://dbpedia.org/resource/N.EX.T__\"Here_I_Stand_For_You\"__1 contains bad sequence \
10:05:24 | ERROR | ValidationExecutor$testIri | http://dbpedia.org/resource/N.EX.T__\"Here_I_Stand_For_You\"__1 contains bad sequence "
10:05:24 | ERROR | ValidationExecutor$testIri | http://dbpedia.org/resource/N.EX.T__\"Here_I_Stand_For_You\"__1 contains bad sequence \
10:05:24 | ERROR | ValidationExecutor$testIri | http://dbpedia.org/resource/N.EX.T__\"Here_I_Stand_For_You\"__1 contains bad sequence "
Cov_s: 1.0 ( 18 triggered of 18 total ), Success_rate_s: 0.8888889 ( 16 )
Cov_p: 0.9577465 ( 68 triggered of 71 total ), Success_rate_p: 1.0 ( 68 )
Cov_o: 0.92890996 ( 196 triggered of 211 total ), Success_rate_o: 1.0 ( 196 )
Cov: 0.9622188
so we can look into it
https://en.wikipedia.org/wiki/Special:Export/Dahlak_SC
https://en.wikipedia.org/wiki/Special:Export/Mini_(Mark_I)
seems like no __
data is extracted in our minidump, checking....
<http://dbpedia.org/resource/Atlantic_Ocean> <http://xmlns.com/foaf/0.1/depiction> <http://en.wikipedia.org/wiki/Special:FilePath/Atlantic Ocean location map.svg> .
@Vehnem not sure, why this trigger is not working, the error is not showing:
# TODO trigger does not seem to work
trigger:wikipedia
a v:RDF_IRI_Trigger ;
trigger:pattern "^http://en.wikipedia.org/wiki/" .
<#wikipedia_IRIs> a v:TestGenerator ; v:trigger trigger:wikipedia ;
v:validator validator:dissallowed_chars ; v:validator validator:dbpedia_resource_delims .
4. `https://en.wikipedia.org/wiki/Special:Export/Kerala_Agricultural_University`
foaf:name is now properly encoded
curl http://dbpedia-mappings.tib.eu/release/mappings/mappingbased-literals/2019.08.01/mappingbased-literals_lang=en.ttl.bz2 | bzcat | grep 'Kerala_Agricul' | grep foaf` http://dbpedia.org/resource/Kerala_Agricultural_University http://xmlns.com/foaf/0.1/name "Kerala Agricultural University (KAU)"@en . http://dbpedia.org/resource/Kerala_Agricultural_University http://xmlns.com/foaf/0.1/name "\"@en .
Issues are: it is not produced by the minidump any more, `\\` is valid, but doesn't make sense
5. `https://en.wikipedia.org/wiki/Special:Export/Ranma_½`
In addition to the low-level testing there is a Sansa-Stack parser pass planned as well, which should cover the Unicode NFC issue.
@LorenzBuehmann thanks for the feedback. It is not yet fixed, but we have a way to record these at least systematically now.
@LorenzBuehmann I fixed 1. the " problem:
Cov_s: 1.0 ( 18 triggered of 18 total ), Success_rate_s: 1.0 ( 18 )
Cov_p: 0.9577465 ( 68 triggered of 71 total ), Success_rate_p: 1.0 ( 68 )
Cov_o: 0.92890996 ( 196 triggered of 211 total ), Success_rate_o: 1.0 ( 196 )
Cov: 0.9622188
I've downloaded files
https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.03.01/categories_lang=en_skos.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.03.01/categories_lang=en_labels.ttl.bz2
And there are IRIs containing an unencoded space: '32'
https://en.wikipedia.org/wiki/The_Ren_%26_Stimpy_Show is encoded as: https://dbpedia.org/resource/The_Ren_&_Stimpy_Show
check:
curl http://dbpedia-mappings.tib.eu/release/mappings/mappingbased-literals/2019.06.01/mappingbased-literals_lang=en.ttl.bz2 | bzcat | cut -f1 -d '>' | grep '&'
on https://databus.dbpedia.org/marvin/mappings/mappingbased-literals/2019.06.01