Link: Super-confident predictions

tupini07 commented 5 years ago

Implement the super confident predictions presented in #305 for the link module.

NOTE: The results presented in this issue are no longer relevant since the super confident functionality has been replaced with sklearn's VotingClassifier, and its results vary slightly (they can be seen in #359). Leaving these results for documentation purposes, but they shouldn't be referenced

tupini07 commented 5 years ago

The following table was generated by running the following command on all catalogs, entites, join methods, and merge methods.

python -m soweego linker link all $catalog $entity -jm $join_method $merge_method

join methods define how the predictions given by different classifiers are "joined". These can be union and intersection.

merge methods defines how the duplicate predictions are dealt with. The possible options for this are vote and average.

When choosing vote, the amount of predictions above the threshold (0.5) and below are counted. If the 50% or more of the votes are above the threshold then the prediction stays.
For average what is done is that all repeated predictions for a pair of entities are averaged, and in such a way we get the final prediction. If this is above the threshold then the prediction stays, otherwise it is dropped.

tupini07 commented 5 years ago

Join Method	Merge Method	Catalog	Entity	# of Predictions	Mean	STD
union	average	discogs	musician	41962	0.844120	0.113590
union	average	discogs	band	22915	0.845280	0.108071
union	average	musicbrainz	musician	29276	0.763107	0.178739
union	average	musicbrainz	band	8038	0.786354	0.160570
union	average	imdb	director	4535	0.892913	0.076288
union	average	imdb	actor	44748	0.785855	0.131458
union	average	imdb	producer	1368	0.798094	0.135311
union	average	imdb	musician	43165	0.788824	0.143173
union	average	imdb	writer	6174	0.815480	0.145471
union	vote	discogs	musician	40778	0.999780	0.005602
union	vote	discogs	band	22759	0.999630	0.010452
union	vote	musicbrainz	musician	29284	0.999399	0.007995
union	vote	musicbrainz	band	8046	0.993113	0.024518
union	vote	imdb	director	4569	0.993607	0.036986
union	vote	imdb	actor	44198	0.999783	0.007831
union	vote	imdb	producer	1251	1.0	0.0
union	vote	imdb	musician	40708	0.996481	0.026852
union	vote	imdb	writer	5716	0.999750	0.006734
intersection	average	discogs	musician	41962	0.844120	0.113590
intersection	average	discogs	band	22915	0.845280	0.108071
intersection	average	musicbrainz	musician	29276	0.763107	0.178739
intersection	average	musicbrainz	band	8038	0.786354	0.160570
intersection	average	imdb	director	4535	0.892913	0.076288
intersection	average	imdb	actor	44748	0.785855	0.131458
intersection	average	imdb	producer	1368	0.798094	0.135311
intersection	average	imdb	musician	43165	0.788824	0.143173
intersection	average	imdb	writer	6174	0.815480	0.145471
intersection	vote	discogs	musician	40778	0.999780	0.005602
intersection	vote	discogs	band	22759	0.999630	0.010452
intersection	vote	musicbrainz	musician	29284	0.999399	0.007995
intersection	vote	musicbrainz	band	8046	0.993113	0.024518
intersection	vote	imdb	director	4569	0.993607	0.036986
intersection	vote	imdb	actor	44198	0.999783	0.007831
intersection	vote	imdb	producer	1251	1.0	0.0
intersection	vote	imdb	musician	40708	0.996481	0.026852
intersection	vote	imdb	writer	5716	0.999750	0.006734

tupini07 commented 5 years ago

Figure_1

I've normalized the counts among each group (for each catalog/entity separately), and then proceeded to get the log of this so that small differences can be better appreciated.

tupini07 commented 5 years ago

Figure_1

The first figure shows the mean prediction obtained by each method. The second shows the standard deviations.

An explanation for why the mean of the methods using vote is so high is that after considering the votes, if the prediction is to stay then the highest vote is assigned as the final prediction. For the LinearSVC model, the predictions are always 0 and 1, so that if it predicts one then this be the highest prediction, and that finally used.

On the other hand, for the STDs, the methods using average may be more noisy because they consider predictions given by different models as equal. And the predictions yielded by different models all come from a different distribution (for example, LinearSVC always predicts either 0 or 1).

tupini07 commented 5 years ago

Necessary changes have been made to the linker module and results have been posted to this issue.

The changes are in the super-confident-predictions branch, and will be merged into master once #349 is complete.

tupini07 commented 5 years ago

Reopening issue: In case of duplicate predictions when doing the union or intersection joins of the predictions we should use a majority vote to decide if a prediction will end up in the final "linking" set or not.

Currently we take the "optimistic", or best, prediction as the final one and discard the duplicates

tupini07 commented 5 years ago

As an idea: We could separate the join method into (union and intersection), and the combination method into (average or majority vote). And let the user decide which combination should be used

Wikidata / soweego

Link: Super-confident predictions #350