Decide on best way to exploit relations between the overlapping entities in the dataset

Wikidata / soweego

Link Wikidata items to large catalogs

https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2

GNU General Public License v3.0

97 stars 9 forks source link

Decide on best way to exploit relations between the overlapping entities in the dataset #369

Closed tupini07 closed 4 years ago

tupini07 commented 5 years ago

We current have overlapping entities (ie, imdb/musician, discogs/musician, and musicbrainz/musician) and we're only linking them with the wikidata set. However, since these are overlapping, there may be entities which are actually the same underlying person.

We could use this information to improve the current prediction accuracy.

tupini07 commented 5 years ago

To expand a bit on the problem: suppose we have 3 external catalog entities (A, B, and C) and we have one target entity D.

Currently we can get how likely is it that each of these external entities is == D. Say that:

A == D -> 40%
B == D -> 20%
C == D -> 89%

We can also get how likely is it that these entities actually are the same entity:

A == B -> 10%
A == C -> 20%
B == C -> 87%

In the example we can see that C == D is very likely, but B == D is not. However, B == C is also very likely. So the problem we want to solve is: how probable is it actually that B == D given that (B == C AND C == D).

tupini07 commented 5 years ago

A note about scaling.

We can't compare all entities that may be overlapping (ie, all musician entities), because that would be a huge amount of comparisons, especially if we introduce more musician catalogs in the future. I'm thinking that a nice way to cope with this problem is continue using the current blocking mechanism where we take one wikidata entity at a time and get all those from the external catalogs which have a similar name, then we proceed to use this group as the one for which we apply the procedure above. It is still a large number of comparisons but is much more manageable

For features and blocks of samples: we might need to compute these on the fly.

tupini07 commented 5 years ago

For the procedure we should train a new classifier for each pair of catalogs.

tupini07 commented 4 years ago

For the moment, exploiting the relations among entities is outside the scope of the project. For this reason I've added the won't fix label and I'll close the issue. In the future, this issue may be reconsidered.