Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
95 stars 8 forks source link

Feature engineering #214

Open marfox opened 5 years ago

marfox commented 5 years ago
tupini07 commented 5 years ago

I may have a suggestion, at least for the IMDb data, in most cases we have the top movies/TV-series by which someone is known . Do we have something similar data coming from Wikidata?

marfox commented 5 years ago

@tupini07 , I think it's a good idea, but the implementation is not trivial. In Wikidata, the person - work relation doesn't seem to be there, while the inverse exists.

For instance, given the director Alex de la Iglesia (Q250627) and the movie El día de la bestia Q1312929, we would only find the director (P57) property in the movie item.

This also holds for the music domain, related to #80

tupini07 commented 5 years ago

Yes you're right. It might be a bit too complex and specific to only the IMDb dataset.

Other possible ideas for generic features would be to match the gender and the place of birth/place of death fields. gender is readily available in most data sets: a quick look at the musicbrainz and imdb tells that 20% of people in musicbrainz have a gender, and 100% of those in imdb

The occurrence of place of death/birth is much lower (none of the entries in imdb have a place of birth/death, and 4.5% of those in musicbrainz have one). However, it might be a powerful feature for those entries that do have it.

tupini07 commented 5 years ago

Another idea would be to leverage the information we currently have in the IMDb dataset about the main occupations of a person. During the importing process these occupations are already transformed to their respective QIDs (as decided in #165 ), so for each person we basically have a list of QIDs representing their professions.

We could even compare them directly, probably the recordlinkage provides some functionality to do this.