Topic labeling - Githubissues

Currently, the score of each word per cluster is computed like this:

score_{cluster}({word}) = \sum_{cluster-doc} (TF-IDF(word) * distance(centroid, doc))

Where

IDF is based on the entire corpus
distance is the cosine distance between the centroid and the document embedding.

Alternative options to investigate:

TF-IDF based on documents in cluster, instead of entire corpus
different methods for computing distance
assign different weights to TF-IDF and to distance
LLM-generated topic descriptions, as used by Bertopic; also see this Colab notebook
[...]

Design idea:

extract a KeywordExtractor class that holds a TfIdfVectorizer and Corpus object
- fit(): fit vectorizer to documents in corpus
- top_words(corpus) extract top words for a corpus. The vectorizer is fitted on KeywordExtractor.corpus, whereas the corpus can be any (sub-)corpus.
remove Corpus._vectorizer attribute
- introduce Corpus.top_words attribute to store top words
- re-consider how Corpus.label attribute is used and set
- introduce Corpus.collection and/or more details about how corpus was retrieved