dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
850 stars 270 forks source link

Info box property having both resource and string literal #713

Open mubashar1199 opened 3 years ago

mubashar1199 commented 3 years ago

In a wikipedia infobox when a property value contains both resource and string literal, then the dbpedia framework skips the literal part which results in incorrect data being populated in DBpedia.

Please see the spouse and parents properties of entity Aung_Lwin

Aung_Lwin Wiki profile AungLwin

Aung_Lwin DBpedia profile AungLwin_dbp

where Daw is a Burmese_Name

Suggestions:

  1. Currently either resource or string can be used in triples generated by Extractors, however a compound object which contains both resource and literal can be introduced and can be used as value in triples.

OR

  1. When there exists both resource and string literal in property value then we should extract the complete value as String and skip linkage of resource.

Please assign this issue to me.

JJ-Author commented 3 years ago

the rdfs:range of dbo:parent is not a literal but dbo:Person. So this is the correct / expected behaviour to never return a string literal for this property. But indeed this incorrect triple (due to a very strange wikipedia link) is supposed to be filtered out in a post processing since dbr:Burnese_name is not supposed to be a Person (see step 2 here https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/2021.06.01 ). However this filtering did not seem to work (https://databus.dbpedia.org/yasgui/#query=SELECT+*+%7B%0A++SERVICE+%3Cx-binsearch%3Avfs%3Ahttp4s%3A%2F%2Fdatabus.dbpedia.org%2Fdbpedia%2Fmappings%2Fmappingbased-objects%2F2021.06.01%2Fmappingbased-objects_lang%3Den.ttl.bz2%3E+%0A++%7B+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FAung_Lwin%3E+%3Fp+%3Fo+%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Query+1&headers=%7B%7D&outputFormat=table) So that would be worth for further investigation.

mubashar1199 commented 3 years ago

the rdfs:range of dbo:parent is not a literal but dbo:Person. So this is the correct / expected behaviour to never return a string literal for this property. But indeed this incorrect triple (due to a very strange wikipedia link) is supposed to be filtered out in a post processing since dbr:Burnese_name is not supposed to be a Person (see step 2 here https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/2021.06.01 ). However this filtering did not seem to work (https://databus.dbpedia.org/yasgui/#query=SELECT+*+%7B%0A++SERVICE+%3Cx-binsearch%3Avfs%3Ahttp4s%3A%2F%2Fdatabus.dbpedia.org%2Fdbpedia%2Fmappings%2Fmappingbased-objects%2F2021.06.01%2Fmappingbased-objects_lang%3Den.ttl.bz2%3E+%0A++%7B+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FAung_Lwin%3E+%3Fp+%3Fo+%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Query+1&headers=%7B%7D&outputFormat=table) So that would be worth for further investigation.

Yes it should be filtered out in the post processing, but i think there is also some issue in the extraction process. Even when there is a Person in the property value, it still trims the remaining part of property value as soon as it find some resource in the property value. Please see the wikipedia and dbpedia profile of Jigme Dorji Wangchuck. In dbpedia it only stores Ashi which is the title of parent and spouse and not the actual person

JJ-Author commented 3 years ago

Hm indeed I think 2 triples should be extracted.

Just to make sure i had a look at the post processing with this pretty neat SPARQL query ))+AS+%3Ffile)%0A++OPTIONAL+%7B%0A++++SERVICE+SILENT+%3Ffile+%7B%0A++++++%7B+SELECT+*+%7B%0A++++++++%3Fs+%3Fp+%3Fo%0A++++++++FILTER+(%3Fs+%3D+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FJigme_Dorji_Wangchuck%3E)%0A++++++%7D+LIMIT+50+%7D%0A++++%7D%0A++%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Check+Post+Processing&headers=%7B%7D&outputFormat=table) to check if one of the triples was pruned, which seems to be not the case.

mubashar1199 commented 3 years ago

I made the following change in DataParserConfig.scala file //Previous regex val splitPropertyNodeRegexObject = Map ( "en" -> """<br\s\/?>|\n| and | or | in |/|;|,""" ) //New regex val splitPropertyNodeRegexObject = Map ( "en" -> """<br\s\/?>|\n| and | or | in |/|;|''|,""" )

and now the required triples are also extracted for Jigme Dorji Wangchuck and other entites. //Result jigme_ubuntu

JJ-Author commented 3 years ago

Nice, the example result looks good to me. I wonder why this is language specific? Is there need to fix this for other languages as well? However, this is a very critical change (can effect a lot of resources) @Vehnem @jlareck @kurzum is this something we need a large scale evaluation for, is the new strategy to include it and decide after one monthly extraction?

mubashar1199 commented 3 years ago

Nice, the example result looks good to me. I wonder why this is language specific? Is there need to fix this for other languages as well? However, this is a very critical change (can effect a lot of resources) @Vehnem @jlareck @kurzum is this something we need a large scale evaluation for, is the new strategy to include it and decide after one monthly extraction?

I only found instances of English language where single quotations ‘’ are used as separators when writing title as prefix before/after actual name, therefore i think currently only english chapter need these changes.

mubashar1199 commented 3 years ago

I have checked the post processing file typeconsistencycheck.scala, currently nondisjoined triples are also stored in regular set, even when there is domain/range violation. Due to this many instances have spouse and other family relations which are Agent and not Person. As this is clearly range violation, these must be stored in a separate dataset and not the regular one. Please see the entity Benjamin_Bates_IV it has spouse and parent Bates_family which is Agent and not Person(range of spouse and parent). There are alot of examples like this. What you think about it? Please correct me if I understood something wrong.

mubashar1199 commented 3 years ago

Also please answer this https://forum.dbpedia.org/t/dbpedia-post-processing-for-adhoc-extraction/1412

JJ-Author commented 2 years ago

The relevant post processing step is called here. https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/-/blob/master/functions.sh#L62. Yeah I see the problem for Bates_family. I think the underlying problem which led to this potentially confusing decision is that sometimes types are not specific enough when extracted (e.g. sometimes Actors are only Persons instead of Actors). So filtering out these triples with an "exact match" for domain / range could have many false positives and lead to many useful triples being filtered out But I think it makes sense to progress here and just create more fine grained files not only disjoint but also. Maybe it makes sense to have one file per length in the class hierarchy tree (so length zero means it is an exact match, 1 means it is of the type of the parent class only. And the end we can still decide if we still load them, but at least they are separated. But potentially more important is to also cover the case that the entity does not actually have a type. That is the case for Burmese_name which has owl:Thing see here))+AS+%3Ffile)%0A++OPTIONAL+%7B%0A++++SERVICE+SILENT+%3Ffile+%7B%0A++++++%7B+SELECT+*+%7B%0A++++++++%3Fs+%3Fp+%3Fo%0A++++++++FILTER+(%3Fs+%3D+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FBurmese_name%3E)%0A++++++%7D+LIMIT+50+%7D%0A++++%7D%0A++%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Check+Post+Processing&headers=%7B%7D&outputFormat=table). In the end no range will ever be disjoint to owl:Thing.

Any thoughts on that?

mubashar1199 commented 2 years ago

I have created a SPARQL query to get instances where spouse and parent is same, It returned 44 results which are below: //Query result: parentAndSpouse To correct this i made two changes:

  1. Changed the regex to include single quotations as separators. //Previous regex val splitPropertyNodeRegexObject = Map ( "en" -> """<br\s/?>|\n| and | or | in |/|;|,""" ) //New regex val splitPropertyNodeRegexObject = Map ( "en" -> """<br\s/?>|\n| and | or | in |/|;|''|,""" )
  2. Untyped and non-disjoint triples are not stored in regular dataset, For testing i stored them in disjoint dataset but later new datasets must be created. //Changes Dataset

After making two changes i again tested the above query and the incorrect results are reduced to only 4, 3 of them are due to wrong classification of entity by some extractor and reason for remaining 1 is unknown.

Remaining 4 incorrect records: Resource Parent and Spouse 1.http://dbpedia.org/resource/Benjamin_Bates_IV http://dbpedia.org/resource/Bates_family 2.http://dbpedia.org/resource/Aung_Lwin http://dbpedia.org/resource/Burmese_name 3.http://dbpedia.org/resource/Gottfried_Graf_von_Bismarck-Schönhausen http://dbpedia.org/resource/House_of_Hoyos 4.http://dbpedia.org/resource/Frank_John_William_Goldsmith http://dbpedia.org/resource/Née

mubashar1199 commented 2 years ago

The relevant post processing step is called here. https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/-/blob/master/functions.sh#L62. Yeah I see the problem for Bates_family. I think the underlying problem which led to this potentially confusing decision is that sometimes types are not specific enough when extracted (e.g. sometimes Actors are only Persons instead of Actors). So filtering out these triples with an "exact match" for domain / range could have many false positives and lead to many useful triples being filtered out But I think it makes sense to progress here and just create more fine grained files not only disjoint but also. Maybe it makes sense to have one file per length in the class hierarchy tree (so length zero means it is an exact match, 1 means it is of the type of the parent class only. And the end we can still decide if we still load them, but at least they are separated. But potentially more important is to also cover the case that the entity does not actually have a type. That is the case for Burmese_name which has owl:Thing see here))+AS+%3Ffile)%0A++OPTIONAL+%7B%0A++++SERVICE+SILENT+%3Ffile+%7B%0A++++++%7B+SELECT+*+%7B%0A++++++++%3Fs+%3Fp+%3Fo%0A++++++++FILTER+(%3Fs+%3D+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FBurmese_name%3E)%0A++++++%7D+LIMIT+50+%7D%0A++++%7D%0A++%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fmclient.aksw.org%2Fsparql&requestMethod=POST&tabTitle=Check+Post+Processing&headers=%7B%7D&outputFormat=table). In the end no range will ever be disjoint to owl:Thing.

Any thoughts on that?

If we filter the entities which have no type, a lot of instances will be removed as Wikipedia to DBpedia mappings coverage is very little up till now. Most of the above mentioned issues are due to Type association which results in incorrect results when queried, therefore i propose to introduce the extraction_score property for each entity based on how its Type is inferred. Following could be the step by step approach:

  1. Categorize the Type of entity and assign extraction_score for each category: Category 1: Type inferred by mappings (could have extraction_score value 1) Category 2: Type inferred by ML predictors like NIF extractor and by other methods (could have extraction_score value 0.5) Category 3: No type (could have extraction_score value 0)
  2. Add new triple of extraction_socre to each entity.
  3. This extraction_score can be used in a SPARQL query by user based on how fine grained results he/she wants. If he/she use extraction_score == 1 then the resultset will only include resources which have type associated by using Mappings

Score value can be adjusted by further analysis of Type inferred process. With the passage of time when the mapping coverage is completed, extraction_score of all entities might reach 1 and hence this property can be safely removed. The advantage here is that we can still get correct results using SPARQL query without decreasing the size of KG.

What you think of it?

JJ-Author commented 2 years ago

I think creating and populating these more fine-grained datasets that you have posted the image is a first good approach, then we can see what happens on large scale for an entire extraction.

The idea with a triple is interesting. But I don't know about types inferred by NIF extractor to be fair. So at the moment I think we would only have 1 and 0 as output (whereas zero means no other type than owl:Thing). So it should be possible to already query this at the moment IIRC?

JJ-Author commented 2 years ago

@mubashar1199 I really like the query that nobody should be its own parent / own spouse. We should add this as a general test. @Vehnem do we have large scale plausibility shacl tests in place already?