Closed Amiri-Yossri closed 1 year ago
You can access the topic-term matrix with topic_model.c_tf_idf_
. If you want to calculate the similarity of a set of words with topic words, you will have to embed the words and apply cosine similarity between them.
Thank you for your answer, can you tell me how this instance .topicembeddings calculate topic embedding, because i want to apply the same procedure to my new words and then i will apply cosine_similarity
When training the BERTopic model and obtaining the topics, I can measure their similarity using cosine_similarity(model.topicembeddings), I am wondering if i can add a new topic with my words and then apply cosine_similarity?
Thank you for your answer, can you tell me how this instance .topicembeddings calculate topic embedding, because i want to apply the same procedure to my new words and then i will apply cosine_similarity
You can use the underlying embedding model for that. Typically, you can use it like topic_model.embedding_model.embed
.
When training the BERTopic model and obtaining the topics, I can measure their similarity using cosine_similarity(model.topicembeddings), I am wondering if i can add a new topic with my words and then apply cosine_similarity?
It would not have a c-TF-IDF representation though if you add it through embeddings only. Moreover, the topic embeddings in the updated version will be created primarily by the centroid of a cluster. Having said that, you can always try it out.
Thank you, but it doesn't work well with me, I still get stuck in the code. My problem: given 4 categories or labels and their associated words, how can I use trained BERTopic so that when i input a document it returns the probability of the 4 categories associated with words from the documents corpus.
Ah, in that case, you can follow the following procedure:
from bertopic import BERTopic
# We need to calculate the probabilities to get the probabilities
# of each document to found topics
topic_model = BERTopic(calculate_probabilities=True).fit(docs)
# Then, we use `transform` to an input document
topic, prob = topic_model.transform(input_document)
# If you also want the associated words, then run the following
topic_model.get_topic(topic)
# If you want to see which words in the input document are associated to which topics
# you can approximate the topic distributions which is not an exact process.
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution([input_document], calculate_tokens=True)
df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])
df
All the above steps are defined in the documentation which has many tutorials and examples for you to go through.
I want the model to give me probabilities that the test_document is associated to each of the 4 documents (the model outputs 4 probabilities associated with words from the test_document.
I am not sure if I understand you correctly, why is the example I gave above not sufficient for your use case? The .approximate_distribution
checks the probabilities associated with words from the test document.
test_document is the document we want to test, we want to classify our document into 4 classes classes = ['travel', 'work', 'politics', 'finance'], so if we perform the code, the output i wanted to be is { 'labels': ['travel', 'work', 'politics', 'finance'], 'scores': [0.9938651919364929, 0.0032737930305302143, 0.0028610294684767723, 0.0048610294684767723], 'distribution of words': L1, L2, L3, L4 } where L1, L2, L3, L4 are respectively travel, work, politics, finance related words, in other words L1 should contain words in test_document related to travel and so on.
The labels
in your example is the same as topic_model.topic_labels_
and the scores
in your example in the same as prob
in my example. The words that you refer to can then be modeled with .approximate_distribution
or the distribution of words with topic_model.c_tf_idf_
.
In other words:
labels
= topic_model.topic_labels_
scores
= prob
distribution of words
= topic_model.c_tf_idf_
topic_model.vectorizer_model.get_features_names
Please try it out and create a minimal example if that does not work.
thank you, but There are some bags in the code. I tried zero shot classification but instead of only labeling on only this 4 topics it creates 156 topics. where is the problem. Here is the code: from bertopic.representation import ZeroShotClassification candidate_topics = ["merger_acquisition", "money_laundering", "HR_Management", "financial_performance"] representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli") topic_model3 = BERTopic(representation_model=representation_model) topic, prob = topic_model3.fit_transform(data) topic_model3.topiclabels
Ah, I misunderstood. So you want to perform zero-shot classification with BERTopic using the four candidate labels?
In that case, I would advise doing something like this:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
# Documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
doc_embeddings = sentence_model.encode(docs)
# Topics
candidate_topics = ["merger_acquisition", "money_laundering", "HR_Management", "financial_performance"]
topic_embeddings = sentence_model.encode(candidate_topics)
# Find topic assignment
sim_matrix = cosine_similarity(doc_embeddings, topic_embeddings)
topics = np.argmax(sim_matrix, axis=1)
You can then use the resulting topics
in BERTopic through manual topic modeling.
thank you, can you give me an idea how i can calculate c-TF-IDF of a new article?
Sure, you can run the following:
bow = topic_model.vectorizer_model.transform(new_article)
ctfidf = topic_model.ctfidf_model.transform(bow )
I got this error:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents) 1425 """ 1426 if isinstance(raw_documents, str): -> 1427 raise ValueError( 1428 "Iterable over raw text documents expected, string object received." 1429 )
ValueError: Iterable over raw text documents expected, string object received.
Do this instead:
bow = model2.vectorizer_model.transform([article1])
Thank you, In the first step, are we using the same embedding model for embedding documents and words, or we are using different embedding methods, and if so is the word embedding is contextual word embededing.
Yes, it's the same model and the model is a sentence-transformer model that has an excellent documentation page for more information.
Thank you, please can you tell me how i can calculate the similarity between a list of given words and the topics. I know how to implement the encoding of topics of the model using topic_model.topicembeddings but i can't determine how I can calculate the embeddings of a list of words.
You can pass each word individually and then average them or you can join them as a single string and then pass them to the embedding model. To do so, you would have to use the topic_model.embedding_model.embed
function.
Thank you, after training BERTopic using a dataset. I want to display words from the test_document that are associated to each topic. I used this command BerTopic_model.get_topic(4) but it displays the words associated to topic 4 from all the corpus of the dataset, but now, I want to display the words from the test_document that are associated to topic 4
You can use .approximate_distribution
for that. It shows which words are related to which topics in a document.
Thank you, but it didn't work well because it returns a matrix and I want it to output words in test_document related to each topic.
The matrix that it gives you back is a token-topic matrix, which essentially describes how well a certain word in test_document
is related to each topic. It is then up to you to extract which word in test_document
is best related to each topic by taking the maximum value and index.
The problem is that I applied countvectorizer to remove stopwords and add bigrams so is there a command that enables me to return tokens from my test-document, so that when i output the topic token matrix it will be easy to extract which words in test_document are best related to each topic by taking the maximum values and indexes.
If you used a CountVectorizer model that used an n-gram range that includes unigrams, then you could still run .approximate_distribution
and only extract the unigram tokens instead of the bigram tokens.
is there a command that enables me to return tokens from my test-document, so that when i output the topic token matrix it will be easy to extract which words in test_document are best related to each topic by taking the maximum values and indexes.
Well, that is exactly what .approximate_distribution
does. It returns the tokens from your test-document
and shows which topics are best related to those tokens.
Have you tried the following?
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)
# Visualize the token-level distributions
df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0])
df
Yes I tried this topic_distr, topic_token_distr = model2.approximate_distribution(article1, calculate_tokens=True), then I used this command: topic_token_distr[0].shape that returned (359, 100) which means that I have 359 tokens and 100 topics. Then I used this code: l1=[] for i in range(topic_token_distr[0].shape[0]): for j in range(topic_token_distr[0].shape[1]): if topic_token_distr[0][i][j] != 0: l1.append((topic_token_distr[0][i][j], i, j)) l1 that returns a list of tuples, each tuple is composed of 3 elements, indicating the probability>0 of a token in raw i belonging to a topic in column j. But now i want to get these tokens, how can I do that?
Have you tried accessing the df
that I showed before? It contains the probabilities that a token belongs to a specific topic, as well as the tokens in the document.
Closing this due to inactivity. Let me know if I need to re-open the issue!
In order to assign a given distribution of words to a topic, I want to calculate similarity between new set of words and topic words, how can i do that