NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
190 stars 41 forks source link

Dealing with overrepresented concepts / blacklisting #735

Open annakasprzik opened 10 months ago

annakasprzik commented 10 months ago

Several institutions have observed that some models / ensembles struggle with concepts that are overrepresented in the training data so that they are suggested way to often. One fix for that is to identify rules that limit the contexts in which those concepts can be suggested. Could we implement something in Annif that allows specifying those rules?

CC @schlawiner @lakshmi-bashyam

related: https://github.com/NatLibFi/Annif/issues/538 ; https://github.com/NatLibFi/Annif/issues/596

osma commented 10 months ago

Thank you for the suggestion. Indeed this seems like a recurring problem, so a generic mechanism could be useful.

This was one of the ideas discussed in issue #538, especially in https://github.com/NatLibFi/Annif/issues/538#issuecomment-979109206 . But there were maybe too many ideas thrown around and so far nothing has been implemented. So let's keep this issue focused on only the problem of overrepresented concepts and the possible solution to make it possible to block problematic concepts, since it seems that both ZBW and ZPID have already decided to use such a mechanism implemented outside Annif.

I think this configuration example from https://github.com/NatLibFi/Annif/issues/538#issuecomment-976532224 is still valid:

[omikuji_stw_en]
vocab=stw_9_10
exclude_concepts=http://zbw.eu/stw/descriptor/19073-6,http://zbw.eu/stw/descriptor/17829-1
backend=omikuji

and the meaning of this would be that the two concepts (USA and Theory) listed in exclude_concepts are ignored both when reading/processing training data and when generating suggestions, but only for this particular project. There could still be other projects using the same vocabulary and the setting would of course not affect those. So in an ensemble, it would be possible to block specific concepts on the level of a particular backend project, if it has a tendency to suggest certain concepts too often without good reason.

As noted in #538, it would make sense to avoid the term "blacklisting" due to connotations. I think "exclude", "block" or "deny" are all valid alternatives.

osma commented 10 months ago

I've thought about the best way to implement something like this in Annif code.

I think this should be a general mechanism and ideally no changes to individual backend implementations should be necessary. This means that the setting should be handled on the level of AnnifProject. One possibility is that SubjectIndex would be made aware of the blocked/excluded concepts, similar to how it already handles deprecated concepts.

For the configuration, this could be implemented as an extra option to the vocab setting. There is already a mechanism to set the vocabulary language using a setting such as vocab=lcsh(en). We could extend that to take another parameter, like this:

vocab=stw(en,exclude=http://zbw.eu/stw/descriptor/19073-6 http://zbw.eu/stw/descriptor/17829-1)

When there's no need to set the language, this could work as well:

vocab=stw(exclude=http://zbw.eu/stw/descriptor/19073-6 http://zbw.eu/stw/descriptor/17829-1)

One minor syntax consideration here is that commas are already used to separate different parameters, so it's not possible to use commas as a separator between concept URIs. Above I've used spaces instead, but other symbols such as | (pipe) could work as well - as long as they are not used in URIs.

juhoinkinen commented 3 months ago

Just throwing in the idea: could the denylisting be (also) "dynamic", in the sense that the suggest request could include a parameter containing the concepts that are not wanted at that particular time? I think there could be some users of Annif API that could benefit from this.

This could be useful for e.g. university repositories, as very many theses and dissertations get the "final projects (education)" concept as a unwanted and redundant suggestion. I assume they now exclude that concept in their own system(?) to not show it to the student.

Another use case would be to restrict the suggestions using the ontology hierarchy e.g. to only all physical objects or some groups. There could be even a UI component where a user could select the allowed or denied concepts in the hierarchy tree. That would be cool, but maybe not so useful.