Open ianrandman opened 2 months ago
A very interesting approach to the problem! Thanks for sharing such an extensive description of the process.
Keeping in mind my original goal, are there any apparent flaws in this approach or suggestions for improvement. I understand there are some https://github.com/MaartenGr/BERTopic/issues/814 out there related to multiple topics per document, such as Topic Distributions or using the probabilities that are returned by transforming my documents after all my steps, but I have not had much luck getting any sort of useful distribution, and a probability matrix is only returned on fit.
Interestingly, I think this might be solved a little easier than your implementation. If I am not mistaken, what you are essentially doing is running cosine similarity between the document and the zero-shot topics and assigning a single document to multiple zero-shot topics if it exceeds a certain threshold.
Although your approach seems valid, it might be a bit easier if you would look at BERTopic's .fit
and .transform
as two separate processes:
.fit
is mainly used to derive the topic representations. It is meant to create reasonable (whatever that means) representations of the topics and it's main output are those representations, such as labels and words..transform
, in contrast, is used to create the topic assignments where the documents are actually assigned to their respective topics. As long as you are happy with the topic representations during .fit
, regardless of whether the documents are correctly assigned to one or more topics, there is no need to go through your process. Instead, you would need to focus on the topic assignment primarily by using a method like .approximate_distribution
which you mentioned does not give you a useful distribution.
My first question would be of course, why? Why isn't it useful to you since it does return a probability matrix of sorts that you can use with a user-specified threshold.
Having said that, you could also simply save the model using .safetensors
and then load the model. What happens is that the underlying dimensionality reduction and cluster models are removed. Now, whenever you run .transform
, it will use the cosine similarity between topic and document embeddings to generate the exact same similarity matrix that you have created manually. You can use that output to assign a single document to multiple topics using the same threshold you specified for zero-shot topic modeling.
The above is a bit of a hidden trick which I would like to make more visible. I hope in the coming months to have some time to create a variable in .transform
that will allow you to select the method of prediction, for instance:
topics, probs = topic_model.transform(documents, method="embeddings")
Hope this helps!
Goal
I am interested in fitting a BERTopic model using zero-shot topic modeling. I want it to be possible for documents to be assigned to more than one of my suggested topics. I have patched several BERTopic functions to enable this but wanted to get the author's opinion on correctness or alternatives.
The current implementation assigns documents to at most one suggested topic based on a specified cosine similarity threshold during model construction. If the threshold is met for a specific document, it is assigned to the topic with which it has the highest similarity.
My Approach
My first change is to the
_zeroshot_topic_modeling
function. I calculate which topics each document has a similarity with that exceeds the specified threshold. Next, if a document has more than one match, additional copies of that document (and its embedding) are made as necessary (keeping copies adjacent in the list of documents). Because this function does not have access to my documents outside of BERTopic, I set an instance variable that provides enough information to make copies of my documents and embeddings as necessary.In the
_combine_zeroshot_topics
step, there is an occasional issue where the merged model topics is set as anp.ndarray
rather than alist
, which causes problems later on. Fixing this is my second patch.The next step is to
reduce_topics
. Here, multiple zero-shot topics may essentially be merged, causing the new topics to sometimes have duplicated documents. I have patched_reduce_to_n_topics
just beforedocuments.Topic = new_topics
to remove duplicate documents within a topic based on external document IDs I provide via an instance variable. I keep only unique(topic_id, external_document_ID)
pairs. That instance variable with external document IDs is updated to a (possibly) reduced list of IDs, which I use outside of BERTopic to update my list of documents and embeddings.The next step is
reduce_outliers
, wheredocuments
is my expanded list of documents (with potential duplicates) after fitting. I do not believe there is any risk here from duplicated documents, because any outliers that get reclassified only had one copy anyway.The last step is
update_topics
using the updated documents list and topic IDs from thereduce_outliers
step. Because there is no reorganization of topics, I believe there is no risk here from duplicated documents.After all this, I have postprocessing to determine the list of topics for each of my original documents.
Questions
Keeping in mind my original goal, are there any apparent flaws in this approach or suggestions for improvement. I understand there are some other methods out there related to multiple topics per document, such as Topic Distributions or using the probabilities that are returned by
transform
ing my documents after all my steps, but I have not had much luck getting any sort of useful distribution, and a probability matrix is only returned on fit.One alternative I thought of is to update
topic_model.topics_
after fitting based on a threshold and the probabilities, update my documents and embeddings accordingly, and then keep thereduce_topics
patch to avoid duplicates. This would have the benefit of multiple topics for a document not just for the suggested topics but also for ones that came for clustering. A downside is an additional threshold to specify.Thoughts?