Closed tupini07 closed 4 years ago
I've taken all the final predictions for imdb/musician
, discogs/musician
, musicbrainz/musician
given by gate_classifier
and joined them. Below we can see the result of counting how many times a QID appears. The plot shows the how frequent each count is. For instance, we can see that it is very common for a QID to appear 5 times in total.
The plot below shows the frequency of the counts of QIDs (like the one above), but only for those entries which are predicted as a match.
Information on the data (musician classifications) used:
2_111_164
33_357 (1.58% of total predictions)
230_417
29_644 (12.86% of unique QIDs)
103_626 (44.97% of unique QIDs)
172_128 (74.7% of unique QIDs)
42 (0.01% of unique QIDs)
2_121 (0.92% of unique QIDs)
2_121
29_644 (12.86% of unique QIDs)
12_611 (5.47% of unique QIDs)
Of special interest for our problem are the following points:
#3
tells us how many unique Wikidata entities we're working with. #5
and #6
tell us how many WD entities have been compared against entities in all catalogs (imdb, discogs, and musicbrainz) or in at least two catalogs (respectively).If we were to implement the mix overlapping procedure it would end up affecting 29_644
(point #10
) WD entities. The reason for this is because: 1) we don't need to consider overlapping predictions which have all been marked as not a match since the only way to mix them is to leave them as non-match. 2) it only makes sense considering entities which have been compared with more than one catalog, and matched with at least one of them.
For the moment, exploiting the relations among entities is outside the scope of the project. For this reason I've added the won't fix
label and I'll close the issue. In the future, this issue may be reconsidered.
As said in #369 we might get overlapping predictions after classifying similar groups of entities (ie, for
imdb/musician
,discogs/musician
,musicbrainz/musician
there might be some predictions which map to the same wikidata entity).This task is to check how common this actually is.
musician
dataset