dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
853 stars 269 forks source link

Handle module = {{Infobox musical artist|embed=yes #500

Open DiegoMoussallem opened 7 years ago

DiegoMoussallem commented 7 years ago

Hi, I have been working on NLG( Natural Language Generation) and I have perceived some inconsistencies in the released version of DBpedia which is online.

Persons such as dbr:Will_Smith, dbr:Shakira, and dbr:Mariah_Carey do not contain their most specific class from DBpedia Ontology. Although their Wikipedia pages provide this information, all of them are dbo:Person, but if you look for dbr:Michael_Jackson, He is dbo:MusicalArtist. Then I don't where these mistakes come from.

Another point is related to rdfs:label and dbo:title. In the same resources above others, their dbo:occupations don't contain rdfs:label, they just have dbo:title as labeled property. Is it right? Shouldn't they contain both?

mgns commented 7 years ago

Before you post issues in the GitHub issue tracker make sure it's an actual problem with the extraction framework implemention.

(1) (a) Since all the named resources use the infobox template Infobox person dbo:Person is the most specific type. There is no trivial (and no implemented) approach to know dbr:Michael_Jackson is dbo:MusicalArtist. (b) dbr:Michael_Jackson a dbo:MusicalArtist . could be added by post-processing steps, such as SDTypes. If it is not, that's unfortunate and (from my experience) unexpected.

(2) The rdfs:label property could be added to the Infobox person mapping. Currently, only dbo:title is used for the intermediate PersonFunction node.

mgns commented 7 years ago

<http://dbpedia.org/resource/Michael_Jackson> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MusicalArtist> . is contained in the SDTypes dataset as expected ;)

DiegoMoussallem commented 7 years ago

I think you didn't get my point. Probably I haven't explained myself properly. Forget about dbr:Michael_Jackson, he was only an example where I tried to emphasize that the other resources have their most specific class from DBpedia Ontology instead of only being a dbo:person.

I tried to show that resources such as dbr:Will_Smith, dbr:Mariah_Carey and dbr:Shakira do not contain a given most specific class. You only got dbr:Michael_Jackson for answering me. But, pls have a look at the others.

Regarding "Before you post issues in the GitHub issue tracker make sure it's an actual problem with the extraction framework implemention." . I posted at DBpedia slack channel about where I should post inconsistencies and if there were different places to do that( eg, extraction framework and ontology). But, I got an answer to post here.

mgns commented 7 years ago

Then read the answer the other way around.

(1) (a) The most specific type for dbr:Will_Smith, dbr:Mariah_Carey and dbr:Shakira is dbo:Person, because all the named resources use the infobox template Infobox person. There is no trivial (and no implemented) approach to know such resources are of any other type. (b) Any other type could be added by post-processing steps, such as SDTypes. If it is not, that's unfortunate.

DiegoMoussallem commented 7 years ago

Hi Magnus, I think there is a misunderstanding between what I'm explaining here and your thoughts. Look, I know that all the named resources use the infobox template Infobox person. I'm not a layperson. When I said to you "pls have a look at the others", it is because I already have looked for their infoboxes beforehand and they share the same structure and have almost same records except to Will_Smith and Shakira which the information sometimes into infoboxes are divided by * instead of |.

Then again, Have a look.

Mariah Carey has | occupation = {{hlist|Singer|songwriter|actress|record producer}} and | module = {{Infobox musical artist|embed=yes

Michael Jackson has | occupation = <!--Please do not add any more occupations to the list, it is long enough already-->{{hlist|Singer|songwriter|dancer|actor|record producer|businessman|philanthropist}} and module = {{Infobox musical artist|embed=yes

Then, the types should be extracted correctly for Mariah Carey and others. But, in our example only Michael Jackson was. You have explained above "“If it is not, that's unfortunate.”. It basically means that: M. Jackson has its m.s.type => it was fortunate M. Carey has no m.s.type => it was unfortunate

so, the whole issue is around this fact… shall we consider it as a bug?

mgns commented 7 years ago

Alright, now it becomes clearer. Sorry, I did not recognize these embedded infoboxes. But they are obviously not regarded at all. There is also no information on genre and instrument for Mariah Carey in the dataset (though it is in this embedded infobox).

As said, <http://dbpedia.org/resource/Michael_Jackson> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MusicalArtist> . is determined by the probabilistic SDTypes algorithm, i.e. it was not extracted from the infobox originally, same as for the other named resources.

I cannot say, why these embedded infoboxes are not regarded, whether there is reason, and what implications this would have on other resources.

The template allows so called Modules. Possibly these need to be mapped somehow in order to be processed.

So far, I'd not consider it a bug, rather a missing feature if you want to have it ;)

jimkont commented 7 years ago

Hi Diego, can you try and reformulate the issue at stake here? For each case can you write what Wikipedia has, what DBpedia has and what DBpedia should have (based on what info from Wikipedia)?

This will make it easier for everyone to follow your arguments and it is also future proof, as, in a few weeks the new release is coming that might change the facts. Thanks

jimkont commented 7 years ago

Also, when you copy wikitext from Wikipedia, make sure you use the version that DBpedia extacted and not the latest one e.g. for M. Jackson you can find it from the following triple prov:wasDerivedFrom http://www.w3.org/ns/prov#wasDerivedFrom

On Thu, Mar 30, 2017 at 5:11 PM, Dimitris Kontokostas jimkont@gmail.com wrote:

Hi Diego, can you try and reformulate the issue at stake here? For each case can you write what Wikipedia has, what DBpedia has and what DBpedia should have (based on what info from Wikipedia)?

This will make it easier for everyone to follow your arguments and it is also future proof, as, in a few weeks the new release is coming that might change the facts. Thanks

-- Kontokostas Dimitris

DiegoMoussallem commented 7 years ago

Thanks @mgns I'm glad that you understood my point. =)

Hi Dimitris @jimkont , for sure I can do that. Regarding the #wasDerivedFrom. Look at Mariah Carey,

https://en.wikipedia.org/w/index.php?title=Mariah_Carey&diff=cur&oldid=706953826

This is the version where the information about Mariah Carey came from and the needed information to build the type were there. (see the diff) . But as Magnus explained it is a probabilistic model, I simply didn't get why it happens. I will read the paper pointed by him, but I guess it is a silly mistake which can be fixed quickly.

Finally, Should I post here or send it by email using the mail-list? you can close this issue if it is by email.

jimkont commented 7 years ago

Thanks Diego, better continue here for now it wold be easier for everyone who reads this issue if you copied all the related info inline to avoid lookups

VladimirAlexiev commented 7 years ago

@DiegoMoussallem I posted a separate issue about occupation: https://github.com/dbpedia/extraction-framework/issues/513. Please change the title to "Handle module = {{Infobox musical artist|embed=yes", and use that issue as an example how to provide info. Cheers!

DiegoMoussallem commented 7 years ago

Done. Also, I'm trying to understand the model mentioned by Magnus before reporting here many issues. It's not only about artists, but other types like dbr:Brigitte_Bardot is only a Person according to DBpedia ontology.

VladimirAlexiev commented 7 years ago

afaik, SDtypes use Machine Learning techniques. Eg if it sees many songs by Michael Jackson, it concludes he's a MusicalArtist (not because it has such rule, but because it's seen many songs for others that are explicitly marked MusicalArtist).