Closed aelb66 closed 1 year ago
After we fit and transform the model, it produces the main topic assigned to each document. If we find that theres a document (e.g. document 1) who's main topic allocated by the model is topic 10 but you find this document is better suited to another topic (e.g. topic 1), is it possible to reassign the main topic and train the model again to produce the subjectively better category? Is this what semi-supervised topic modelling does?
You can update the internal topics such that it matches the topics that you feel are better suited. However, it will not change the model itself. The underlying clustering model is HDBSCAN, and like many others, does not allow for manual selection of clusters.
Is this what semi-supervised topic modelling does?
Not exactly, semi-supervised topic modeling uses UMAP to nudge the dimensionality reduction of certain documents together. However, this does not mean that it will force those labels.
Would the resulting topic predictions in the tutorial only have topics with keywords related to the pre-defined topic labels - i.e. computers and ignore the other 15 categories?
It depends on which pre-defined topic labels you mention. More specifically, the keywords in a certain topic depends fully on the documents that occupy that topic. It does make a comparison between all other topics in order to decide the extent to which certain keywords are more or less important across topics.
What happens to the documents assigned -1 label, will these just be assigned topic -1 or would they have created topics assigned to them as well? Example outputs for the tutorial would be great to compare between unsupervised and semi-supervised.
The -1 labels are completely ignored during the dimensionality reduction of UMAP. They essentially mean that we are not sure which possible label they can have. In other words, they can end up in any topic, including the outlier topic. What is important to realize is that this process is done before the actual clustering, so it is merely a step of reducing the dimensionality in such a way that certain documents become a bit closer to one another.
Thank you very much for your answers so far.
Would the resulting topic predictions in the tutorial only have topics with keywords related to the pre-defined topic labels - i.e. computers and ignore the other 15 categories?
It depends on which pre-defined topic labels you mention. More specifically, the keywords in a certain topic depends fully on the documents that occupy that topic. It does make a comparison between all other topics in order to decide the extent to which certain keywords are more or less important across topics.
Rephrasing my question above - in the semi-supervised topic modelling tutorial here, what would the resulting topics and their corresponding keywords after semi-supervised topic modelling look like compared to performing unsupervised topic modelling, a supervised topic modelling approach and a guided modelling approach on the same data?
Just another question! After training the model on a dataset about people's opinions on a new building project and saving the model, do you think it appropriate for the loaded model to predict on a new dataset e.g. dataset about people's opinions on climate change, or do you recommend the loaded model be further trained and then predicted on a new dataset each time?
Rephrasing my question above - in the semi-supervised topic modelling tutorial here, what would the resulting topics and their corresponding keywords after semi-supervised topic modelling look like compared to performing unsupervised topic modelling, a supervised topic modelling approach and a guided modelling approach on the same data?
If you have the labels, then I would highly advise testing it out on your own to get some intuition on what is happening! Having said that, it is difficult to say that a specific and stable set of things will be different in the output between the approaches. Instead, it might be worthwhile to start from how the technique works and then built your way up to what can potentially be different. So let's start! Remember that BERTopic uses the following process:
At the moment, v0.13 will change some things, semi-supervised topic modeling is rather similar to supervised topic modeling in that they use UMAP to guide the dimensionality reduction process with the labels you provide. As a result, points with the same labels are nudged towards the same space in lower dimensionality and a clustering algorithm tends to then find those points as clusters. This process can be described as nudging or guiding. Although we provide labels to guide the dimensionality reduction process, there is no direct influence on clustering. So if the clustering algorithm finds more fine-grained topics in a specific label, then it will return those instead the labels that were defined. The difference between semi-supervised and supervised is that this process is run for either some documents or all documents respectively. In v0.13 both process will be referred to as semi-supervised topic modeling whereas supervised topic modeling will be a classification task instead of a cluster task. Thus, the process of semi-supervised topic modeling is as follows:
With guided topic modeling we are not looking at labels for individual documents but describe a set of terms that we want to assign to certain documents. The process is a bit more complex but in essence, it tries to find documents that describe those terms best and nudge them toward each other. Instead of giving them labels, we are nudging these documents by averaging out the embeddings of the documents with the embeddings of the terms that describe them best:
How they exactly differ in their output depends on many things, including how you initialized the labels and how you initialized the seeded topics. In practice, you might expect that a (semi-)supervised approach tends to nudge the topic creation towards those labels but has quite some freedom in how micro-clusters are being created. In contrast, with guided topic modeling you have a more direct influence on the training data, namely the embeddings, as well as the keywords that you end up with.
After training the model on a dataset about people's opinions on a new building project and saving the model, do you think it appropriate for the loaded model to predict on a new dataset e.g. dataset about people's opinions on climate change, or do you recommend the loaded model be further trained and then predicted on a new dataset each time?
If you expect that the new dataset contains topics that were not found in the original dataset, then it would be worthwhile to either include that data if possible or look towards online topic modeling as an alternative.
Thank you so much for the detailed answers, I really appreciate the effort! Its very useful.
Hi Maarten,
I just have couple of questions-
On the topic of semi supervised, you mentioned in your tutorial: "In semi-supervised topic modeling, we only have some labels for our documents. The documents for which we do have labels are used to somewhat guide BERTopic to the extraction of topics for those labels. The documents for which we do not have labels are assigned a -1."
Thanks so much!