Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
97 stars 9 forks source link

Link: Super-confident predictions #350

Closed tupini07 closed 5 years ago

tupini07 commented 5 years ago

Implement the super confident predictions presented in #305 for the link module.

NOTE: The results presented in this issue are no longer relevant since the super confident functionality has been replaced with sklearn's VotingClassifier, and its results vary slightly (they can be seen in #359). Leaving these results for documentation purposes, but they shouldn't be referenced

tupini07 commented 5 years ago

The following table was generated by running the following command on all catalogs, entites, join methods, and merge methods.

python -m soweego linker link all $catalog $entity -jm $join_method $merge_method

join methods define how the predictions given by different classifiers are "joined". These can be union and intersection.

merge methods defines how the duplicate predictions are dealt with. The possible options for this are vote and average.

tupini07 commented 5 years ago
Join Method Merge Method Catalog Entity # of Predictions Mean STD
union average discogs musician 41962 0.844120 0.113590
union average discogs band 22915 0.845280 0.108071
union average musicbrainz musician 29276 0.763107 0.178739
union average musicbrainz band 8038 0.786354 0.160570
union average imdb director 4535 0.892913 0.076288
union average imdb actor 44748 0.785855 0.131458
union average imdb producer 1368 0.798094 0.135311
union average imdb musician 43165 0.788824 0.143173
union average imdb writer 6174 0.815480 0.145471
union vote discogs musician 40778 0.999780 0.005602
union vote discogs band 22759 0.999630 0.010452
union vote musicbrainz musician 29284 0.999399 0.007995
union vote musicbrainz band 8046 0.993113 0.024518
union vote imdb director 4569 0.993607 0.036986
union vote imdb actor 44198 0.999783 0.007831
union vote imdb producer 1251 1.0 0.0
union vote imdb musician 40708 0.996481 0.026852
union vote imdb writer 5716 0.999750 0.006734
intersection average discogs musician 41962 0.844120 0.113590
intersection average discogs band 22915 0.845280 0.108071
intersection average musicbrainz musician 29276 0.763107 0.178739
intersection average musicbrainz band 8038 0.786354 0.160570
intersection average imdb director 4535 0.892913 0.076288
intersection average imdb actor 44748 0.785855 0.131458
intersection average imdb producer 1368 0.798094 0.135311
intersection average imdb musician 43165 0.788824 0.143173
intersection average imdb writer 6174 0.815480 0.145471
intersection vote discogs musician 40778 0.999780 0.005602
intersection vote discogs band 22759 0.999630 0.010452
intersection vote musicbrainz musician 29284 0.999399 0.007995
intersection vote musicbrainz band 8046 0.993113 0.024518
intersection vote imdb director 4569 0.993607 0.036986
intersection vote imdb actor 44198 0.999783 0.007831
intersection vote imdb producer 1251 1.0 0.0
intersection vote imdb musician 40708 0.996481 0.026852
intersection vote imdb writer 5716 0.999750 0.006734
tupini07 commented 5 years ago

Figure_1

I've normalized the counts among each group (for each catalog/entity separately), and then proceeded to get the log of this so that small differences can be better appreciated.

tupini07 commented 5 years ago

Figure_1

Figure_1

The first figure shows the mean prediction obtained by each method. The second shows the standard deviations.

An explanation for why the mean of the methods using vote is so high is that after considering the votes, if the prediction is to stay then the highest vote is assigned as the final prediction. For the LinearSVC model, the predictions are always 0 and 1, so that if it predicts one then this be the highest prediction, and that finally used.

On the other hand, for the STDs, the methods using average may be more noisy because they consider predictions given by different models as equal. And the predictions yielded by different models all come from a different distribution (for example, LinearSVC always predicts either 0 or 1).

tupini07 commented 5 years ago

Necessary changes have been made to the linker module and results have been posted to this issue.

The changes are in the super-confident-predictions branch, and will be merged into master once #349 is complete.

tupini07 commented 5 years ago

Reopening issue: In case of duplicate predictions when doing the union or intersection joins of the predictions we should use a majority vote to decide if a prediction will end up in the final "linking" set or not.

Currently we take the "optimistic", or best, prediction as the final one and discard the duplicates

tupini07 commented 5 years ago

As an idea: We could separate the join method into (union and intersection), and the combination method into (average or majority vote). And let the user decide which combination should be used