MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.17k stars 764 forks source link

topic classification #1292

Closed Amiri-Yossri closed 1 year ago

Amiri-Yossri commented 1 year ago

In order to assign a given distribution of words to a topic, I want to calculate similarity between new set of words and topic words, how can i do that

MaartenGr commented 1 year ago

You can access the topic-term matrix with topic_model.c_tf_idf_. If you want to calculate the similarity of a set of words with topic words, you will have to embed the words and apply cosine similarity between them.

Amiri-Yossri commented 1 year ago

Thank you for your answer, can you tell me how this instance .topicembeddings calculate topic embedding, because i want to apply the same procedure to my new words and then i will apply cosine_similarity

Amiri-Yossri commented 1 year ago

When training the BERTopic model and obtaining the topics, I can measure their similarity using cosine_similarity(model.topicembeddings), I am wondering if i can add a new topic with my words and then apply cosine_similarity?

MaartenGr commented 1 year ago

Thank you for your answer, can you tell me how this instance .topicembeddings calculate topic embedding, because i want to apply the same procedure to my new words and then i will apply cosine_similarity

You can use the underlying embedding model for that. Typically, you can use it like topic_model.embedding_model.embed.

When training the BERTopic model and obtaining the topics, I can measure their similarity using cosine_similarity(model.topicembeddings), I am wondering if i can add a new topic with my words and then apply cosine_similarity?

It would not have a c-TF-IDF representation though if you add it through embeddings only. Moreover, the topic embeddings in the updated version will be created primarily by the centroid of a cluster. Having said that, you can always try it out.

Amiri-Yossri commented 1 year ago

Thank you, but it doesn't work well with me, I still get stuck in the code. My problem: given 4 categories or labels and their associated words, how can I use trained BERTopic so that when i input a document it returns the probability of the 4 categories associated with words from the documents corpus.

MaartenGr commented 1 year ago

Ah, in that case, you can follow the following procedure:

from bertopic import BERTopic

# We need to calculate the probabilities to get the probabilities
# of each document to found topics
topic_model = BERTopic(calculate_probabilities=True).fit(docs)

# Then, we use `transform` to an input document
topic, prob = topic_model.transform(input_document)

# If you also want the associated words, then run the following
topic_model.get_topic(topic)

# If you want to see which words in the input document are associated to which topics
# you can approximate the topic distributions which is not an exact process. 
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution([input_document], calculate_tokens=True)
df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])
df

All the above steps are defined in the documentation which has many tutorials and examples for you to go through.

Amiri-Yossri commented 1 year ago

I want the model to give me probabilities that the test_document is associated to each of the 4 documents (the model outputs 4 probabilities associated with words from the test_document.

MaartenGr commented 1 year ago

I am not sure if I understand you correctly, why is the example I gave above not sufficient for your use case? The .approximate_distribution checks the probabilities associated with words from the test document.

Amiri-Yossri commented 1 year ago

test_document is the document we want to test, we want to classify our document into 4 classes classes = ['travel', 'work', 'politics', 'finance'], so if we perform the code, the output i wanted to be is { 'labels': ['travel', 'work', 'politics', 'finance'], 'scores': [0.9938651919364929, 0.0032737930305302143, 0.0028610294684767723, 0.0048610294684767723], 'distribution of words': L1, L2, L3, L4 } where L1, L2, L3, L4 are respectively travel, work, politics, finance related words, in other words L1 should contain words in test_document related to travel and so on.

MaartenGr commented 1 year ago

The labels in your example is the same as topic_model.topic_labels_ and the scores in your example in the same as prob in my example. The words that you refer to can then be modeled with .approximate_distribution or the distribution of words with topic_model.c_tf_idf_.

In other words:

Please try it out and create a minimal example if that does not work.

Amiri-Yossri commented 1 year ago

thank you, but There are some bags in the code. I tried zero shot classification but instead of only labeling on only this 4 topics it creates 156 topics. where is the problem. Here is the code: from bertopic.representation import ZeroShotClassification candidate_topics = ["merger_acquisition", "money_laundering", "HR_Management", "financial_performance"] representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli") topic_model3 = BERTopic(representation_model=representation_model) topic, prob = topic_model3.fit_transform(data) topic_model3.topiclabels

MaartenGr commented 1 year ago

Ah, I misunderstood. So you want to perform zero-shot classification with BERTopic using the four candidate labels?

In that case, I would advise doing something like this:

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

# Documents
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
doc_embeddings = sentence_model.encode(docs)

# Topics
candidate_topics = ["merger_acquisition", "money_laundering", "HR_Management", "financial_performance"]
topic_embeddings = sentence_model.encode(candidate_topics)

# Find topic assignment
sim_matrix = cosine_similarity(doc_embeddings, topic_embeddings)
topics = np.argmax(sim_matrix, axis=1)

You can then use the resulting topics in BERTopic through manual topic modeling.

Amiri-Yossri commented 1 year ago

thank you, can you give me an idea how i can calculate c-TF-IDF of a new article?

MaartenGr commented 1 year ago

Sure, you can run the following:

bow = topic_model.vectorizer_model.transform(new_article)
ctfidf = topic_model.ctfidf_model.transform(bow )
Amiri-Yossri commented 1 year ago

I got this error: ValueError Traceback (most recent call last) in <cell line: 1>() ----> 1 bow = model2.vectorizer_model.transform(article1) 2 ctfidf = model2.ctfidf_model.transform(bow)

/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents) 1425 """ 1426 if isinstance(raw_documents, str): -> 1427 raise ValueError( 1428 "Iterable over raw text documents expected, string object received." 1429 )

ValueError: Iterable over raw text documents expected, string object received.

MaartenGr commented 1 year ago

Do this instead:

bow = model2.vectorizer_model.transform([article1])
Amiri-Yossri commented 1 year ago

Thank you, In the first step, are we using the same embedding model for embedding documents and words, or we are using different embedding methods, and if so is the word embedding is contextual word embededing.

MaartenGr commented 1 year ago

Yes, it's the same model and the model is a sentence-transformer model that has an excellent documentation page for more information.

Amiri-Yossri commented 1 year ago

Thank you, please can you tell me how i can calculate the similarity between a list of given words and the topics. I know how to implement the encoding of topics of the model using topic_model.topicembeddings but i can't determine how I can calculate the embeddings of a list of words.

MaartenGr commented 1 year ago

You can pass each word individually and then average them or you can join them as a single string and then pass them to the embedding model. To do so, you would have to use the topic_model.embedding_model.embed function.

Amiri-Yossri commented 1 year ago

Thank you, after training BERTopic using a dataset. I want to display words from the test_document that are associated to each topic. I used this command BerTopic_model.get_topic(4) but it displays the words associated to topic 4 from all the corpus of the dataset, but now, I want to display the words from the test_document that are associated to topic 4

MaartenGr commented 1 year ago

You can use .approximate_distribution for that. It shows which words are related to which topics in a document.

Amiri-Yossri commented 1 year ago

Thank you, but it didn't work well because it returns a matrix and I want it to output words in test_document related to each topic.

MaartenGr commented 1 year ago

The matrix that it gives you back is a token-topic matrix, which essentially describes how well a certain word in test_document is related to each topic. It is then up to you to extract which word in test_document is best related to each topic by taking the maximum value and index.

Amiri-Yossri commented 1 year ago

The problem is that I applied countvectorizer to remove stopwords and add bigrams so is there a command that enables me to return tokens from my test-document, so that when i output the topic token matrix it will be easy to extract which words in test_document are best related to each topic by taking the maximum values and indexes.

MaartenGr commented 1 year ago

If you used a CountVectorizer model that used an n-gram range that includes unigrams, then you could still run .approximate_distribution and only extract the unigram tokens instead of the bigram tokens.

is there a command that enables me to return tokens from my test-document, so that when i output the topic token matrix it will be easy to extract which words in test_document are best related to each topic by taking the maximum values and indexes.

Well, that is exactly what .approximate_distribution does. It returns the tokens from your test-document and shows which topics are best related to those tokens.

Have you tried the following?

# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)

# Visualize the token-level distributions
df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])
df
Amiri-Yossri commented 1 year ago

Yes I tried this topic_distr, topic_token_distr = model2.approximate_distribution(article1, calculate_tokens=True), then I used this command: topic_token_distr[0].shape that returned (359, 100) which means that I have 359 tokens and 100 topics. Then I used this code: l1=[] for i in range(topic_token_distr[0].shape[0]): for j in range(topic_token_distr[0].shape[1]): if topic_token_distr[0][i][j] != 0: l1.append((topic_token_distr[0][i][j], i, j)) l1 that returns a list of tuples, each tuple is composed of 3 elements, indicating the probability>0 of a token in raw i belonging to a topic in column j. But now i want to get these tokens, how can I do that?

MaartenGr commented 1 year ago

Have you tried accessing the df that I showed before? It contains the probabilities that a token belongs to a specific topic, as well as the tokens in the document.

MaartenGr commented 1 year ago

Closing this due to inactivity. Let me know if I need to re-open the issue!