NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
195 stars 41 forks source link

Refactor: use sparse matrices instead of ListSuggestionResult and VectorSuggestionResult #678

Closed osma closed 1 year ago

osma commented 1 year ago

Currently the annif.suggestion module defines two main classes for representing suggestion results, ListSuggestionResult and VectorSuggestionResult. Backends generally use either of these (usually not both) when returning results.

But now that we have started implementing batched suggestions in backends, these representations have become a bit awkward, as they are specific to one document and perhaps also needlessly complicated (often suggestions have to be converted back and forth between the two representations).

I think it would make sense to try to replace both of these classes with a single class, maybe called SuggestionBatch, that can represent the suggestion results for a whole batch of documents (up to 32). It could be implemented using a SciPy sparse matrix or perhaps a sparse array, as arrays are nowadays recommended by SciPy.

This is a rather intrusive refactoring:

Regarding how to tackle this: I think it would make sense to implement this first on the "outside", that is, changing the return type of AnnifProject.suggest to the new representation, since it already returns a batch, though currently it's a list of {List,Vector}SuggestionResult objects. The eval, CLI and REST code that relies on AnnifProject.suggest must of course be changed too. After this is working, the last step is to change the backends to return the new representation.