Closed carschno closed 1 month ago
Currently, the score of each word per cluster is computed like this:
score_{cluster}({word}) = \sum_{cluster-doc} (TF-IDF(word) * distance(centroid, doc))
Where
distance
Alternative options to investigate:
KeywordExtractor
TfIdfVectorizer
Corpus
fit()
top_words(corpus)
corpus
KeywordExtractor.corpus
Corpus._vectorizer
Corpus.top_words
Corpus.label
Corpus.collection
Currently, the score of each word per cluster is computed like this:
Where
distance
is the cosine distance between the centroid and the document embedding.Alternative options to investigate:
Design idea:
KeywordExtractor
class that holds aTfIdfVectorizer
andCorpus
objectfit()
: fit vectorizer to documents in corpustop_words(corpus)
extract top words for acorpus
. The vectorizer is fitted onKeywordExtractor.corpus
, whereas thecorpus
can be any (sub-)corpus.Corpus._vectorizer
attributeCorpus.top_words
attribute to store top wordsCorpus.label
attribute is used and setCorpus.collection
and/or more details about how corpus was retrieved