Tfidf backend should ignore subjects that are not part of the training data

thomaslow commented 3 years ago

Hi, I'm currently looking into various subject classification algorithms supporting subject hierarchies and did some initial tests with Annif and its backends. I discovered a minor problem with the tfidf vectorization implementation.

I first observed the issue when comparing evaluation results of the tfidf backend when loading a subject vocabulary either using a TSV file or a SKOS turtle file. The evaluation results were not exactly the same, even though the training and test data was the same in both cases.

It seems that unused subjects, meaning subjects that are not part of the training data, but are present in a vocabulary, e.g., due to their relationship (broader, narrower, related) to other subjects, are still added as empty buffers to the scikit-learn TfidfVectorizer. The resulting tfidf vector will be a zero-vector and all predictions (cosine similarities) will be 0 for that subject. However, the inverse-document-frequency for terms is calculated using a higher number of subjects (including unused subjects).

To my knowledge, this will (slightly) reduce the effectiveness of the tfidf-backend in distinguishing rare terms from frequent terms, and, in case of very large SKOS files with many thousands of unused subjects, might even negatively impact its predictive performance.

A possible solution would be to filter out empty subjects before calling fit_transform. However, an additional subject index needs to be kept in order to remember which score (vector index) belongs to which subject.

Cheers, Thomas

osma commented 3 years ago

Hi @thomaslow , thank you for the issue report. You're right that the tfidf backend builds a model with all the subjects, even those not referenced in training data.

The tfidf backend is really quite simple and intended to be a first stepping stone towards more advanced backends. It's easy to set up and fast to train, but not really expected to give very good results in terms of quality.

Would you by any chance be interested in implementing a change to the tfidf backend with your proposed solution to the problem (filtering out empty subjects and maintaining a mapping between index IDs and subject IDs)? We're always very happy to accept pull requests.

thomaslow commented 3 years ago

Hi @osma, I'm sorry, I don't think I will have the time. As you said, tfidf is just a first step. I just mentioned it, because I was curious that the tfidf backend did not produce the exact same results for the same training and test data.

At the moment I'm mostly experimenting with different algorithms and approaches that can learn from a hierarchy of subjects. Annif helped a lot to get an overview over different backends and do some first experiments. I even wrote a small Python script and custom AnnifProject class to evaluate and compare multiple Annif backends with other approaches.

Unfortunately, many features that are important for my use case are still missing in Annif (cross validation, metrics that consider subject hierarchies, document meta data, etc). So, at the moment, I'm working on putting these pieces together in a separate python module.

osma commented 3 years ago

Thanks, I understand. Good to hear that you're also experimenting with other backends. I recommend taking a close look at Omikuji, since at least for us it has consistently achieved good results in very different scenarios (multiclass or multilabel, small or large vocabulary...)

I'd be curious to hear more about the features you are missing in Annif. Would it be possible for you to open new issues requesting them to be added? I can't promise we will implement them (and PRs are very welcome, as I said above!), but just defining the feature would be an important first step in that process. There may be others in the community who have similar needs and could also chime in and perhaps help out.

For cross validation, I've thought that a CLI command like annif xval my-project --folds 5 path/to/corpus could be possible to implement. I seem to remember that Maui had a command like this.

Regarding metrics for hierarchies, there is an open issue #466 "Implement hierarchical precision, recall and F1 scores" - would that be relevant to you? Perhaps you could comment on the issue itself?

What do you mean by document meta data?

Also, if you discover algorithms that work well in your use case (learning from a hierarchy of subjects), it would be great if you could tell more about your and perhaps suggest including them as Annif backends.

thomaslow commented 3 years ago

Ok, I'll add a few issues about said features.

NatLibFi / Annif

Tfidf backend should ignore subjects that are not part of the training data #531