MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.18k stars 765 forks source link

Using the model for document predictions #543

Closed drob-xx closed 2 years ago

drob-xx commented 2 years ago

Hi Maarten,

I know that this question has come up innumerable times, and I've been scanning through the issues, but I just want to make sure I'm not missing anything. If we want to get a rating of the dominant topic to for each document then just use the matrix produced when calculate_probabilities is set to True? If I got that wrong could you point out issue threads where this is addressed?

Also, somewhat related--In the last couple of months I remember reading a post (Medium??) where the author used BERTopic as part of a larger process to develop topic keywords and then used those words to produce document probabilities (represented by TfIDF scores?) to categorize individual documents - does any of that ring a bell? I can't find the link anywhere.

Thanks in advance!

MaartenGr commented 2 years ago

If we want to get a rating of the dominant topic to for each document then just use the matrix produced when calculate_probabilities is set to True?

Yes, that is definitely a common way of approaching this specific use case. By setting calculate_probabilities to True, we calculate the probabilities of each topic belonging to each document. That way, we have a rough indication of the likelihood of the dominant topics in a document.

Having said that, I believe that the most accurate way of doing this is by splitting up your documents into sentences. Although this does not hold for all sentences, a sentence typically holds a single topic. Thus, by splitting up the documents into sentences and passing those to BERTopic, we can simply count how often certain topics appear in the documents by counting the related sentences.

Also, somewhat related--In the last couple of months I remember reading a post (Medium??) where the author used BERTopic as part of a larger process to develop topic keywords and then used those words to produce document probabilities (represented by TfIDF scores?) to categorize individual documents - does any of that ring a bell? I can't find the link anywhere.

Hmmm, that does not ring bell, unfortunately. Do you have a use case in mind that you want to use it for?

drob-xx commented 2 years ago

Thanks for confirming that!

I believe that the most accurate way of doing this is by splitting up your documents into sentences.

Interesting. I suppose I should be able to call transform on the sentences? I thought that BERT tends not to work well on short text?

In terms of the Tf-IDF thing - I'm playing around with different ways of scoring text. However at this point I think I've convinced myself that extracting the vocabulary built with BERTopic and then using Tf-IDF to do the scoring doesn't make much sense. I'm pretty sure I saw an article asserting that this was a viable strategy.

MaartenGr commented 2 years ago

Interesting. I suppose I should be able to call transform on the sentences? I thought that BERT tends not to work well on short text?

Actually, the base model that is being used is based on sentence-transformers and, like its name suggest, works extremely well on sentences. This also means that short text, sentences and paragraphs, is actually preferred in BERTopic.

In terms of the Tf-IDF thing - I'm playing around with different ways of scoring text. However at this point I think I've convinced myself that extracting the vocabulary built with BERTopic and then using Tf-IDF to do the scoring doesn't make much sense. I'm pretty sure I saw an article asserting that this was a viable strategy.

The one thing that might be interesting to use is to use the fitted c-TF-IDF model on documents instead of the traditional TF-IDF model. That way, you can score individual documents whilst having some information regarding the topics. I have not tried it out myself extensively, apart from calculating covariates. It goes something like this:

X = topic_model.vectorizer_model.transform(documents)
c_tf_idf = topic_model.transformer.transform(X)

Here, documents are a list of strings (documents).

drob-xx commented 2 years ago

As always this is really interesting information. I don't think I've seen you refer to using the c_tf_idf model like this. I will take a look and post here (hopefully within the week) with more questions / results.

drob-xx commented 2 years ago

I converted my docs to sentences and ran transform on them. It returns a very wide matrix. I was at something of a loss what to do next. I checked the covariates code you supplied earlier but didn't see any hints there. In the BERTopic code itself cosine_similarity is run against the matrix. However, when I ran that (on a colab+ with additional memory) it crashed. What should my next steps be? Should I just run it in batches? I'm flying blind here, help appreciated!

MaartenGr commented 2 years ago

I was at something of a loss what to do next.

That depends on what you want to do with the resulting feature matrix (c_tf_idf). Although it can be used as input for a classification algorithm, I am not entirely sure it would output the original TF-IDF metric let alone improve upon the sentence embeddings as features. You mention scoring text, how do you intend on scoring the text and what would the meaning of those scores be? In other words, what would the end goal be?

drob-xx commented 2 years ago

Might be more straightforward if I ask the question differently:

You suggested:

X = topic_model.vectorizer_model.transform(documents) c_tf_idf = topic_model.transformer.transform(X)

I broke all my documents into sentences and then ran vectorizer_model.transform on them. My 30K documents result in 1.1M sentences. When I run the above transform I get a very large matrix back (1102527, 15640116). Is that what is expected? I'm not sure what to do with this result at the next stage.

MaartenGr commented 2 years ago

When I run the above transform I get a very large matrix back (1102527, 15640116). Is that what is expected? I'm not sure what to do with this result at the next stage.

Yes, that is what you can expect if you run a TF-IDF-like model. It generates a sparse matrix of size n x m, where n is the number of documents, in your case 1.1M, and m is the size of the vocabulary, which would be 1.5M words. Having said that, it might be worthwhile to use a custom vectorizer and set the min_df parameter to a value larger than 1 as it will remove words that you will only find rarely.

The c_tf_idf sparse matrix can then be used for different purposes. Essentially, you can see it as a feature matrix for your sentences which then can be used for, for example, classification. What are you hoping to achieve with the resulting sparse matrix?

drob-xx commented 2 years ago

What are you hoping to achieve with the resulting sparse matrix?

I was following your suggestion above to use the c_tf_idf matrix to do document classification. As I've written elsewhere the project I'm working on is to classify Congressional press releases by partisan lean. I am finally going to go back to #360 and calculate p values as a possible solution.

What I'm curious about at this point is whether or not the LDA approach of categorizing by dominant topic has a parallel with BERTTopic? The obvious issue is that BERTopic tends to "classify" a relatively high percentage of the docs as -1. Of course with LDA each document has a probability for each topic, but from what I've seen you still get a large number of documents with very suspect relationship to the dominant topic. The method I first came up with, as I've mentioned previously, is to develop a vocabulary and then create TF-IDF scores based on those. But I'm less than convinced that this is a robust way of dealing with the issue.

MaartenGr commented 2 years ago

I was following your suggestion above to use the c_tf_idf matrix to do document classification. As I've written elsewhere the project I'm working on is to classify Congressional press releases by partisan lean.

You could use the sparse matrix directly as an input for a supervised classification algorithm. Especially support vector machines have worked well, at least in my experience, with sparse data (TF-IDF like matrices).

The obvious issue is that BERTopic tends to "classify" a relatively high percentage of the docs as -1.

There are several ways of reducing the number of outlier documents but the most effective way of doing that is either reducing outliers by making use of calculate_probabilities as mentioned here or use k-Means instead of HDBSCAN to remove all outliers for which you can find a guide here. The latter would fix the issue you are having with the outliers as well as the computation time for the large amount of data that you have when using calculate_probabilities.

MaartenGr commented 2 years ago

Due to inactivity, this issue will be closed. Feel free to ping me if you want to re-open the issue!