Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
97 stars 9 forks source link

Check how worthwhile is it to mix overlapping predictions #371

Closed tupini07 closed 4 years ago

tupini07 commented 5 years ago

As said in #369 we might get overlapping predictions after classifying similar groups of entities (ie, for imdb/musician, discogs/musician, musicbrainz/musician there might be some predictions which map to the same wikidata entity).

This task is to check how common this actually is.

tupini07 commented 5 years ago

I've taken all the final predictions for imdb/musician, discogs/musician, musicbrainz/musician given by gate_classifier and joined them. Below we can see the result of counting how many times a QID appears. The plot shows the how frequent each count is. For instance, we can see that it is very common for a QID to appear 5 times in total.

qid total count

The plot below shows the frequency of the counts of QIDs (like the one above), but only for those entries which are predicted as a match.

qid matches count

Information on the data (musician classifications) used:

  1. Total predictions: 2_111_164
  2. Total number of matches: 33_357 (1.58% of total predictions)
  3. Total number of unique QIDs: 230_417
  4. Unique QIDs among entries predicted as matches: 29_644 (12.86% of unique QIDs)
  5. Unique QIDs which have been compared with all catalogs (matches or not): 103_626 (44.97% of unique QIDs)
  6. Unique QIDs which have been compared with more than one catalog (matches or not): 172_128 (74.7% of unique QIDs)
  7. Unique QIDs which appear as matches in all catalogs: 42 (0.01% of unique QIDs)
  8. Unique QIDs which appear as matches in more than one catalog: 2_121 (0.92% of unique QIDs)
  9. Unique QIDs which have been compared with all catalogs and have been matched with at least one: 2_121
  10. Unique QIDs which have been compared with more than one catalog and appear as a match in at least one of them: 29_644 (12.86% of unique QIDs)
  11. Unique QIDs which have been compared with all catalogs and appear as a match in at least one of them: 12_611 (5.47% of unique QIDs)

Of special interest for our problem are the following points:

If we were to implement the mix overlapping procedure it would end up affecting 29_644 (point #10) WD entities. The reason for this is because: 1) we don't need to consider overlapping predictions which have all been marked as not a match since the only way to mix them is to leave them as non-match. 2) it only makes sense considering entities which have been compared with more than one catalog, and matched with at least one of them.

tupini07 commented 4 years ago

For the moment, exploiting the relations among entities is outside the scope of the project. For this reason I've added the won't fix label and I'll close the issue. In the future, this issue may be reconsidered.