NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
197 stars 41 forks source link

Support for topic hierarchies #316

Open wetneb opened 5 years ago

wetneb commented 5 years ago

Many topic classification systems (such as the Dewey Decimal Classification or HAL's topic hierarchy) are organized into trees of classes rather than flat lists.

Are you aware of any subject prediction models which take into account this hierarchical structure? Do you have any plans to add support for them?

We can use models for flat classifications by only taking the leaves of the hierarchical classifications, for instance. But as a user I would like that the system is also able to predict coarser classifications (so, internal nodes in the classification tree) when it is not sure enough to pick a precise leaf.

osma commented 5 years ago

You are absolutely right. Currently Annif stores vocabularies as flat lists, so the hierarchy (e.g. from a SKOS file) is lost. That could be fixed, but the larger issue is that most algorithms for subject indexing and document classification only consider a flat list of categories/subjects/classes. I'm sure there are some that can make use of a hierarchical structure, but I haven't come across anything that would be suitable for integration with Annif. If you have some specific models in mind, please add a comment here.

The Maui tools has some support for hierarchies, but only on a very rudimentary level. It will take into account broader/narrower and related links between concepts when it tries to decide which are the most relevant subjects for a particular document. Subject candidates that are related to other candidates (with any type of relationship) may be scored higher, though this depends on how well this heuristic worked in the model building phase.

wetneb commented 5 years ago

I am not aware of any model that does that. Thanks for the pointer to the Maui tools, it is interesting!

osma commented 5 years ago

Maui is a separate project, but there is MauiService which can be used from Annif: https://github.com/NatLibFi/Annif/wiki/Backend%3A-Maui

wetneb commented 5 years ago

If I had time I would be interested to review the literature to see if there is any nice probabilistic model for this sort of setting.

It might just be that there is no real benefit in using a hierarchy in these sort of models - the simplicity of assuming a flat list of topics might just outweigh the benefits of handling the hierarchy.

osma commented 5 years ago

Based on a quick scanning of the paper, this seems like a relevant and sensible survey of approaches for hierarchical classification:

Silla, C. N., & Freitas, A. A. (2011). A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1-2), 31-72. https://doi.org/10.1007/s10618-010-0175-9

I'd be interested in hearing about practical implementations, especially open source software projects, preferably in Python (so they're easy to integrate with Annif).

Of course it would be possible to implement one or more of the methods described in the above paper using e.g. sklearn, but it's a lot more work that way.

osma commented 5 years ago

The sklearn-hierarchical-classification project seems to be exactly what would be needed here. It's a Python module, open source (Apache license), implemented with sklearn, based on the above mentioned paper by Silla & Freitas.

Would you like to give it a spin @wetneb, using your own data sets? It would be good to know if it works for you, and then we could consider integrating it with Annif.

wetneb commented 5 years ago

@osma many thanks for the pointer, it looks perfect indeed! I would be very interested to give it a go (but for my own curiosity mainly, so it might not happen very soon).