NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
204 stars 41 forks source link

Single Concept Classifier for handling label inbalance #538

Open mo-fu opened 3 years ago

mo-fu commented 3 years ago

In Automated subject indexing (and multi label classification in general) the distribution of assigned concepts often follows Zipf's law. In our experience this leads to algorithms having low precision on the most frequently assigned concepts. Subject Indexers at ZBW largest complaint in our last review was the frequent prediction for our top tow concepts. As a remedy we evaluated assigning the most frequent concepts individually. While this lead to a minor decrease in F1 ("samples" avg). It provided benefits in precision and F1 (both "binary" avg.) for single concepts. Here are some results for our top concepts theory and USA.

Classifier Concept(s) used for Metric Evaluation F1 Precision Recall
SVC trained for theory Theory 0.618 0.546 0.712
Omikuji trained on all concepts Theory 0.587 0.458 0.814
SVC trained for USA USA 0.589 0.559 0.628
Omikuji trained on all concepts USA 0.537 0.429 0.717
Omikuji trained on all concepts All concepts 0.4580 0.509 0.477
Combining Omikuji trained on all concepts except USA and theory with individual SVC for the two concepts All Concepts 0.4579 0.518 0.472

Note that the last classifier still uses individual thresholds for the three classifiers. I think using an (neural network )ensemble to combine the results would probably allow use of a single threshold.

I open this issue to discuss if here is interest in bringing this functionality to Annif. And also discuss some implementation details. Adding or modifying existing classifiers (FastText or SVC) for supporting single classes is straightforward. But there are some details regarding overall architecture that are not straightforward to handle:

Looking forward to hearing your thoughts on the topic.

osma commented 3 years ago

Thanks for the suggestion. I think it makes sense to try to support this in Annif, although one would hope that individual algorithms could deal with this kind of imbalance better than they seem to do.

Suppose you want to have a setup of the kind you describe, with Omikuji handling almost every concept but the two most frequent ones using SVC - what would the Annif configuration look like then? Would this involve some special kind of ensemble project delegating to specialized Omikuji and SVC projects? I think that thinking about the configuration aspect would clarify questions about metrics and concept exclusion.

mo-fu commented 3 years ago

My original plan was to do something modular. So the individual classifiers could also be of fasttext type. Config would be like the following, excluding some stuff for brevity.

[svc_stw_en_theory]
single_concept=http://zbw.eu/stw/descriptor/19073-6
vocab=stw_9_10
backend=svc

[svc_stw_en_usa]
single_concept=http://zbw.eu/stw/descriptor/17829-1
vocab=stw_9_10
backend=svc

[omikuji_stw_en]
vocab=stw_9_10
exclude_concepts=http://zbw.eu/stw/descriptor/19073-6,http://zbw.eu/stw/descriptor/17829-1
backend=omikuji

[combined_stw_en]
backend=nn_ensemble
vocab=stw_9_10
sources=omikuji_stw_en,svc_stw_en_usa,svc_stw_en_theory

Possibly also adding other sources to the ensemble. The exclusion of concepts is optional. You could see the results of the SVC as help for correcting the omikuji backend. If the exclusion is implemented this would probably be a general functionality of the backend base class. And also effect the eval command.

osma commented 3 years ago

That looks very reasonable!

Instead of a config option like single_concept (and exclude_concept), would it make sense to think of this as whitelist and blacklist? I.e. both of these settings would be used to filter the set of subjects given to the project/backend (affecting both training examples and evaluation).

Though I guess single_concept operation is a bit special, as the algorithm then would have to make a binary decision (relevant vs. non-relevant). If a whitelist option would be used instead, then it could be given more than one concept (say A,B,C) , and what would the distinction then be? A vs B vs C? Or A vs B vs C vs none-of-them?

Just thinking aloud here...maybe your suggestion is better anyway, just trying to think this through and come up with a generic mechanism that might be useful perhaps even more broadly than the scenario that you describe.

mo-fu commented 3 years ago

whitelist/blacklist would definitely be more general. I think the single concept case would probably require some additional work. (Regarding metrics mostly) Also single concept would allow for easy sampling i.e. 50% data having the concept assigned and 50% without it.

edit: Quick example for logic required to handle the single concept case. sk-learn SVC requires different input for single class (1d-array) vs. multiple classes(2d array with rows being labels for one sample) Nothing bad. Just to keep in mind.

osma commented 3 years ago

blacklist is basically the same as exclude_concept, just a different name.

The problem I see with whitelist is that it's not clear what should happen if you give it many concepts instead of just one. As you mention, the single concept case is different also from the point of view of the algorithms and needs separate logic.

So in terms of configuration, we could certainly do this (your example above, just renamed the setting):

[svc_stw_en_theory]
whitelist=http://zbw.eu/stw/descriptor/19073-6
vocab=stw_9_10
backend=svc

but what happens if we do this:

[svc_stw_en_usa_theory]
whitelist=http://zbw.eu/stw/descriptor/17829-1,http://zbw.eu/stw/descriptor/19073-6
vocab=stw_9_10
backend=svc

What would the SVC try to distinguish then? USA vs. Theory vs. nothing?

What if we change the backend to Omikuji (which supports multi-label, unlike SVC), would it then be different?

In short, whitelist is appealing because it seems more generic than single_concept, but it's not clear how it should be interpreted and implemented in the case of more than one value. Is there a use case for e.g. an Omikuji project which is only trained on a small subset of concepts? Would it help in your imbalanced case to have one Omikuji project that only handles USA and Theory, and another Omikuji that takes care of all the other concepts? Or is using e.g. SVC projects trained on single concepts the only useful solution in that situation?

mo-fu commented 3 years ago

I think for more than one concept the algorithms should just behave as they would normally, just on that subset of concepts. Actually it may be a good starting point to implement whitelist for multiple concepts and raise an exception when there is only one. And then tackle the single concept case afterwards.

I haven't evaluated it but I could imagine omikuji for a subset of concepts. At one point we had the idea of using separate classifiers for a sub thesaurus(e.g., geographic concepts)

osma commented 3 years ago

I think for more than one concept the algorithms should just behave as they would normally, just on that subset of concepts. Actually it may be a good starting point to implement whitelist for multiple concepts and raise an exception when there is only one. And then tackle the single concept case afterwards.

That sounds like a plan!

I haven't evaluated it but I could imagine omikuji for a subset of concepts. At one point we had the idea of using separate classifiers for a sub thesaurus(e.g., geographic concepts)

I can imagine that there could be whitelisting/blacklisting not just by individual concept URIs, but also by e.g. rdf:type or SKOS concept scheme, or maybe even SKOS collection. Something like:

# Omikuji with only YSO places
[yso-omikuji-places]
backend=omikuji
vocab=yso-fi
whitelist_scheme=http://www.yso.fi/onto/yso/places

# Omikuji with only general YSO concepts
[yso-omikuji-general]
backend=omikuji
vocab=yso-fi
whitelist_type=http://www.yso.fi/onto/yso-meta/Concept

# Omikuji that excludes YSO hierarchical concepts
[yso-omikuji-no-hierarchy]
backend=omikuji
vocab=yso-fi
blacklist_type=http://www.yso.fi/onto/yso-meta/Hierarchy

# Omikuji with only YSO concepts in the Archaeology group
[yso-omikuji-archaeology]
backend=omikuji
vocab=yso-fi
whitelist_collection=http://www.yso.fi/onto/yso/p26593

I'm not saying that these should be supported in the first iteration of blacklist/whitelist features, just that it would be possible to expand that functionality in these directions if desired.

As for combining whitelist and blacklist rules, I think each of these rules should be applied separately on the whole vocabulary to create a subset and the final set of concepts/subjects should be the intersection of those subsets.

osma commented 3 years ago

As for combining whitelist and blacklist rules, I think each of these rules should be applied separately on the whole vocabulary to create a subset and the final set of concepts/subjects should be the intersection of those subsets.

On second thought, maybe it would be better to give precedence to whitelist rules, when both whitelists and blacklists are specified. Also, it seems that those terms are going out of favor (maybe) because of possible connotations - the Linux kernel has switched to allowlist/denylist.

I think we need a new issue for this discussion, which is a bit separate from the original idea of a Single Concept Classifier.

osma commented 1 year ago

I think we need a new issue for this discussion, which is a bit separate from the original idea of a Single Concept Classifier.

Nearly two years later, we now have that issue: #735 :tada: