Closed tupini07 closed 4 years ago
To expand a bit on the problem: suppose we have 3 external catalog entities (A
, B
, and C
) and we have one target entity D
.
Currently we can get how likely is it that each of these external entities is == D
. Say that:
A == D -> 40%
B == D -> 20%
C == D -> 89%
We can also get how likely is it that these entities actually are the same entity:
A == B -> 10%
A == C -> 20%
B == C -> 87%
In the example we can see that C == D
is very likely, but B == D
is not. However, B == C
is also very likely. So the problem we want to solve is: how probable is it actually that B == D
given that (B == C
AND C == D
).
A note about scaling.
We can't compare all entities that may be overlapping (ie, all musician
entities), because that would be a huge amount of comparisons, especially if we introduce more musician
catalogs in the future. I'm thinking that a nice way to cope with this problem is continue using the current blocking mechanism where we take one wikidata entity at a time and get all those from the external catalogs which have a similar name, then we proceed to use this group as the one for which we apply the procedure above. It is still a large number of comparisons but is much more manageable
For features and blocks of samples: we might need to compute these on the fly.
For the procedure we should train a new classifier for each pair of catalogs.
For the moment, exploiting the relations among entities is outside the scope of the project. For this reason I've added the won't fix
label and I'll close the issue. In the future, this issue may be reconsidered.
We current have overlapping entities (ie,
imdb/musician
,discogs/musician
, andmusicbrainz/musician
) and we're only linking them with the wikidata set. However, since these are overlapping, there may be entities which are actually the same underlying person.We could use this information to improve the current prediction accuracy.