Open mubashar1199 opened 3 years ago
the rdfs:range of dbo:parent is not a literal but dbo:Person. So this is the correct / expected behaviour to never return a string literal for this property. But indeed this incorrect triple (due to a very strange wikipedia link) is supposed to be filtered out in a post processing since dbr:Burnese_name is not supposed to be a Person (see step 2 here https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/2021.06.01 ). However this filtering did not seem to work (https://databus.dbpedia.org/yasgui/#query=SELECT+*+%7B%0A++SERVICE+%3Cx-binsearch%3Avfs%3Ahttp4s%3A%2F%2Fdatabus.dbpedia.org%2Fdbpedia%2Fmappings%2Fmappingbased-objects%2F2021.06.01%2Fmappingbased-objects_lang%3Den.ttl.bz2%3E+%0A++%7B+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FAung_Lwin%3E+%3Fp+%3Fo+%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Query+1&headers=%7B%7D&outputFormat=table) So that would be worth for further investigation.
the rdfs:range of dbo:parent is not a literal but dbo:Person. So this is the correct / expected behaviour to never return a string literal for this property. But indeed this incorrect triple (due to a very strange wikipedia link) is supposed to be filtered out in a post processing since dbr:Burnese_name is not supposed to be a Person (see step 2 here https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/2021.06.01 ). However this filtering did not seem to work (https://databus.dbpedia.org/yasgui/#query=SELECT+*+%7B%0A++SERVICE+%3Cx-binsearch%3Avfs%3Ahttp4s%3A%2F%2Fdatabus.dbpedia.org%2Fdbpedia%2Fmappings%2Fmappingbased-objects%2F2021.06.01%2Fmappingbased-objects_lang%3Den.ttl.bz2%3E+%0A++%7B+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FAung_Lwin%3E+%3Fp+%3Fo+%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Query+1&headers=%7B%7D&outputFormat=table) So that would be worth for further investigation.
Yes it should be filtered out in the post processing, but i think there is also some issue in the extraction process. Even when there is a Person in the property value, it still trims the remaining part of property value as soon as it find some resource in the property value. Please see the wikipedia and dbpedia profile of Jigme Dorji Wangchuck. In dbpedia it only stores Ashi which is the title of parent and spouse and not the actual person
Hm indeed I think 2 triples should be extracted.
Just to make sure i had a look at the post processing with this pretty neat SPARQL query ))+AS+%3Ffile)%0A++OPTIONAL+%7B%0A++++SERVICE+SILENT+%3Ffile+%7B%0A++++++%7B+SELECT+*+%7B%0A++++++++%3Fs+%3Fp+%3Fo%0A++++++++FILTER+(%3Fs+%3D+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FJigme_Dorji_Wangchuck%3E)%0A++++++%7D+LIMIT+50+%7D%0A++++%7D%0A++%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Check+Post+Processing&headers=%7B%7D&outputFormat=table) to check if one of the triples was pruned, which seems to be not the case.
I made the following change in DataParserConfig.scala file //Previous regex val splitPropertyNodeRegexObject = Map ( "en" -> """<br\s\/?>|\n| and | or | in |/|;|,""" ) //New regex val splitPropertyNodeRegexObject = Map ( "en" -> """<br\s\/?>|\n| and | or | in |/|;|''|,""" )
and now the required triples are also extracted for Jigme Dorji Wangchuck and other entites. //Result
Nice, the example result looks good to me. I wonder why this is language specific? Is there need to fix this for other languages as well? However, this is a very critical change (can effect a lot of resources) @Vehnem @jlareck @kurzum is this something we need a large scale evaluation for, is the new strategy to include it and decide after one monthly extraction?
Nice, the example result looks good to me. I wonder why this is language specific? Is there need to fix this for other languages as well? However, this is a very critical change (can effect a lot of resources) @Vehnem @jlareck @kurzum is this something we need a large scale evaluation for, is the new strategy to include it and decide after one monthly extraction?
I only found instances of English language where single quotations ‘’ are used as separators when writing title as prefix before/after actual name, therefore i think currently only english chapter need these changes.
I have checked the post processing file typeconsistencycheck.scala, currently nondisjoined triples are also stored in regular set, even when there is domain/range violation. Due to this many instances have spouse and other family relations which are Agent and not Person. As this is clearly range violation, these must be stored in a separate dataset and not the regular one. Please see the entity Benjamin_Bates_IV it has spouse and parent Bates_family which is Agent and not Person(range of spouse and parent). There are alot of examples like this. What you think about it? Please correct me if I understood something wrong.
Also please answer this https://forum.dbpedia.org/t/dbpedia-post-processing-for-adhoc-extraction/1412
The relevant post processing step is called here. https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/-/blob/master/functions.sh#L62. Yeah I see the problem for Bates_family. I think the underlying problem which led to this potentially confusing decision is that sometimes types are not specific enough when extracted (e.g. sometimes Actors are only Persons instead of Actors). So filtering out these triples with an "exact match" for domain / range could have many false positives and lead to many useful triples being filtered out But I think it makes sense to progress here and just create more fine grained files not only disjoint but also. Maybe it makes sense to have one file per length in the class hierarchy tree (so length zero means it is an exact match, 1 means it is of the type of the parent class only. And the end we can still decide if we still load them, but at least they are separated. But potentially more important is to also cover the case that the entity does not actually have a type. That is the case for Burmese_name which has owl:Thing see here))+AS+%3Ffile)%0A++OPTIONAL+%7B%0A++++SERVICE+SILENT+%3Ffile+%7B%0A++++++%7B+SELECT+*+%7B%0A++++++++%3Fs+%3Fp+%3Fo%0A++++++++FILTER+(%3Fs+%3D+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FBurmese_name%3E)%0A++++++%7D+LIMIT+50+%7D%0A++++%7D%0A++%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Check+Post+Processing&headers=%7B%7D&outputFormat=table). In the end no range will ever be disjoint to owl:Thing.
Any thoughts on that?
I have created a SPARQL query to get instances where spouse and parent is same, It returned 44 results which are below: //Query result: To correct this i made two changes:
After making two changes i again tested the above query and the incorrect results are reduced to only 4, 3 of them are due to wrong classification of entity by some extractor and reason for remaining 1 is unknown.
Remaining 4 incorrect records: Resource Parent and Spouse 1.http://dbpedia.org/resource/Benjamin_Bates_IV http://dbpedia.org/resource/Bates_family 2.http://dbpedia.org/resource/Aung_Lwin http://dbpedia.org/resource/Burmese_name 3.http://dbpedia.org/resource/Gottfried_Graf_von_Bismarck-Schönhausen http://dbpedia.org/resource/House_of_Hoyos 4.http://dbpedia.org/resource/Frank_John_William_Goldsmith http://dbpedia.org/resource/Née
The relevant post processing step is called here. https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/-/blob/master/functions.sh#L62. Yeah I see the problem for Bates_family. I think the underlying problem which led to this potentially confusing decision is that sometimes types are not specific enough when extracted (e.g. sometimes Actors are only Persons instead of Actors). So filtering out these triples with an "exact match" for domain / range could have many false positives and lead to many useful triples being filtered out But I think it makes sense to progress here and just create more fine grained files not only disjoint but also. Maybe it makes sense to have one file per length in the class hierarchy tree (so length zero means it is an exact match, 1 means it is of the type of the parent class only. And the end we can still decide if we still load them, but at least they are separated. But potentially more important is to also cover the case that the entity does not actually have a type. That is the case for Burmese_name which has owl:Thing see here))+AS+%3Ffile)%0A++OPTIONAL+%7B%0A++++SERVICE+SILENT+%3Ffile+%7B%0A++++++%7B+SELECT+*+%7B%0A++++++++%3Fs+%3Fp+%3Fo%0A++++++++FILTER+(%3Fs+%3D+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FBurmese_name%3E)%0A++++++%7D+LIMIT+50+%7D%0A++++%7D%0A++%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Check+Post+Processing&headers=%7B%7D&outputFormat=table). In the end no range will ever be disjoint to owl:Thing.
Any thoughts on that?
If we filter the entities which have no type, a lot of instances will be removed as Wikipedia to DBpedia mappings coverage is very little up till now. Most of the above mentioned issues are due to Type association which results in incorrect results when queried, therefore i propose to introduce the extraction_score property for each entity based on how its Type is inferred. Following could be the step by step approach:
Score value can be adjusted by further analysis of Type inferred process. With the passage of time when the mapping coverage is completed, extraction_score of all entities might reach 1 and hence this property can be safely removed. The advantage here is that we can still get correct results using SPARQL query without decreasing the size of KG.
What you think of it?
I think creating and populating these more fine-grained datasets that you have posted the image is a first good approach, then we can see what happens on large scale for an entire extraction.
The idea with a triple is interesting. But I don't know about types inferred by NIF extractor to be fair. So at the moment I think we would only have 1 and 0 as output (whereas zero means no other type than owl:Thing). So it should be possible to already query this at the moment IIRC?
@mubashar1199 I really like the query that nobody should be its own parent / own spouse. We should add this as a general test. @Vehnem do we have large scale plausibility shacl tests in place already?
In a wikipedia infobox when a property value contains both resource and string literal, then the dbpedia framework skips the literal part which results in incorrect data being populated in DBpedia.
Please see the spouse and parents properties of entity Aung_Lwin
Aung_Lwin Wiki profile
Aung_Lwin DBpedia profile
where Daw is a Burmese_Name
Suggestions:
OR
Please assign this issue to me.