Open osma opened 6 years ago
Evaluation results with the code in #219 were so bad that I don't think it makes sense to continue in this direction. LSI makes more sense when there are no predefined subjects. It might still be useful for small classifications though.
Here are the evaluation results:
2018-11-27 LSI model for Annif
Created first implementation of LSI model. Set up four projects with num_topics = (100, 200, 400, 800). Loaded yso-fi vocab and trained each model (in parallel, on 4 CPU cores) using yso-finna-fi corpus. Had to kill the 800 topic one because system started swapping.
lsi-fi-100 model built in ~35min CPU time (with some parallel processing) lsi-fi-200 model built in ~41min CPU time lsi-fi-400 model built in ~60 min CPU time, peak memory usage ~6.8GB but usually ~5.4GB
Evaluated on kirjastonhoitaja (tfidf f1@5=0.22): lsi-fi-100 F1@5 0.05287335527720144 lsi-fi-200 F1@5 0.07323910064294681 lsi-fi-400 F1@5 0.09448403253996848 peak mem ~2.5GB
Not very promising…
We are currently using Gensim only for the basic TF-IDF backend. It should be almost trivial to create an LSI backend, it's just one extra LsiModel layer and a single parameter (number of dimensions).
LDA would be possible too, but I'll leave that for another issue.