LSI backend - Githubissues

osma commented 6 years ago

We are currently using Gensim only for the basic TF-IDF backend. It should be almost trivial to create an LSI backend, it's just one extra LsiModel layer and a single parameter (number of dimensions).

LDA would be possible too, but I'll leave that for another issue.

osma commented 5 years ago

Evaluation results with the code in #219 were so bad that I don't think it makes sense to continue in this direction. LSI makes more sense when there are no predefined subjects. It might still be useful for small classifications though.

osma commented 5 years ago

Here are the evaluation results:

2018-11-27 LSI model for Annif

Created first implementation of LSI model. Set up four projects with num_topics = (100, 200, 400, 800). Loaded yso-fi vocab and trained each model (in parallel, on 4 CPU cores) using yso-finna-fi corpus. Had to kill the 800 topic one because system started swapping.

lsi-fi-100 model built in ~35min CPU time (with some parallel processing) lsi-fi-200 model built in ~41min CPU time lsi-fi-400 model built in ~60 min CPU time, peak memory usage ~6.8GB but usually ~5.4GB

Evaluated on kirjastonhoitaja (tfidf f1@5=0.22): lsi-fi-100 F1@5 0.05287335527720144 lsi-fi-200 F1@5 0.07323910064294681 lsi-fi-400 F1@5 0.09448403253996848 peak mem ~2.5GB

Not very promising…

Results improve with more topics, but not that much.
LSI models with >400 topics are probably not realistic
could be tested on classifications instead of YSO
could explore how limiting the vocabulary affects resource usage & results

NatLibFi / Annif

LSI backend #201