EDIT: Until this gets resolved I'm not sure whether I want this implemented at the current moment. There seem to be some strange things going on with the evaluation and although this implementation is not geared towards RAG, it doesn't feel right to continue working on this.
What does this PR do?
The hybrid nature of BERTopic (Bag-of-Words and semantic representations) can be generalized even to the topic representations it creates using a modified version of BM42. It works as follows:
First, we extract the top n representative documents per topic. To extract the representative documents, we randomly sample a number of candidate documents per cluster which is controlled by the nr_samples parameter.
Then, the top n representative documents are extracted by calculating the c-TF-IDF representation for the candidate documents.
For all representative documents per topic, their attention matrix is calculated and all weights are summed. The weights are then multiplied by the IDF values of BERTopic's c-TF-IDF algorithm to get the final BM42 representation. These IDF values are either extracted from creating a new c-TF-IDF on the representativate documents (recalculate_idf=True) or by taking the IDF values of the c-TF-IDF model that was trained on the entire corpus (recalculate_idf=False).
Thus, the algorithm follows some principles of BM42 but does some optimization in
order to speed up inference and it uses the IDF values of c-TF-IDF. Usage is straightforward:
from bertopic.representation import BM42Inspired
from bertopic import BERTopic
# Create your representation model
representation_model = BM42Inspired(
"sentence-transformers/all-MiniLM-L6-v2",
recalculate_idf=True
)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
Before submitting
[ ] This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
EDIT: Until this gets resolved I'm not sure whether I want this implemented at the current moment. There seem to be some strange things going on with the evaluation and although this implementation is not geared towards RAG, it doesn't feel right to continue working on this.
What does this PR do?
The hybrid nature of BERTopic (Bag-of-Words and semantic representations) can be generalized even to the topic representations it creates using a modified version of BM42. It works as follows:
First, we extract the top n representative documents per topic. To extract the representative documents, we randomly sample a number of candidate documents per cluster which is controlled by the
nr_samples
parameter.Then, the top n representative documents are extracted by calculating the c-TF-IDF representation for the candidate documents.
For all representative documents per topic, their attention matrix is calculated and all weights are summed. The weights are then multiplied by the IDF values of BERTopic's c-TF-IDF algorithm to get the final BM42 representation. These IDF values are either extracted from creating a new c-TF-IDF on the representativate documents (
recalculate_idf=True
) or by taking the IDF values of the c-TF-IDF model that was trained on the entire corpus (recalculate_idf=False
).Thus, the algorithm follows some principles of BM42 but does some optimization in order to speed up inference and it uses the IDF values of c-TF-IDF. Usage is straightforward:
Before submitting