MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Online topic modeling_Partial_fit #1185

Closed nfsedaghat closed 1 year ago

nfsedaghat commented 1 year ago

Hi, I am going to use online topic modeling in my project on a data with 360 documents. I use the below code as I copied it from the BERTopic webpage:

` # Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration chunkSize = 100 doc_chunks = [df.loc[i:i+chunkSize-1, 'Description'] for i in range(0, len(df), chunkSize)]

topics = []
ll = 0
for docs in doc_chunks:
    ll += len(docs)
    topic_model.partial_fit(docs)
    print('Number of docs as input to the model: ', len(docs), '\n', 'Number of returned topics: ', len(topic_model.topics_), '\n \n')
    print('**********')
    topics.extend(topic_model.topics_)

` After running this snippet, this is what I get:

Number of docs as input to the model:  100 
 Number of returned topics:  100 

**********
Number of docs as input to the model:  100 
 Number of returned topics:  95 

**********
Number of docs as input to the model:  100 
 Number of returned topics:  100 

**********
Number of docs as input to the model:  60 
 Number of returned topics:  60 

As you see in the second chunck while the number of documents is 100, the number of returned topics is 95. Why such thing happens? I need to mention that sometimes it runs perfectly but sometimes this case occurs.

MaartenGr commented 1 year ago

That is strange indeed! That might be a result of the clustering model but I cannot be sure. Could you share your full code?

nfsedaghat commented 1 year ago

This is the whole code:

from river import stream
from river import cluster
class River:
def init(self, model):
self.model = model

def partial_fit(self, umap_embeddings):
    for umap_embedding, _ in stream.iter_array(umap_embeddings):
        self.model = self.model.learn_one(umap_embedding)

    labels = []
    for umap_embedding, _ in stream.iter_array(umap_embeddings):
        label = self.model.predict_one(umap_embedding)
        labels.append(label)

    self.labels_ = labels
    return self
cluster_model = River(cluster.DBSTREAM())
vectorizer_model = OnlineCountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

bert_model = BERTopic(embedding_model=embedding_model,
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
verbose = False, top_n_words = 30, low_memory = True)

chunkSize = 100
doc_chunks = [df.loc[i:i+chunkSize-1, 'Description'] for i in range(0, len(df), chunkSize)]

topics = []
ll = 0
for docs in doc_chunks:
ll += len(docs)
topic_model.partial_fit(docs)
print('Number of docs as input to the model: ', len(docs), '\n', 'Number of returned topics: ', len(topic_model.topics_), '\n \n')
print('**********')
topics.extend(topic_model.topics_)
MaartenGr commented 1 year ago

I am not entirely sure what is happening here. Could you try it with a different clustering algorithm instead?

nfsedaghat commented 1 year ago

Thanks for your suggestion. I replaced DBSTREAM with CluStream and it worked! Just one more question: In online topic modeling can we plot hierarchical structure of topics as well as topic evolution over time plots every time we feed in the model with new data? How we can interpret them as the topics can change in each model updating?

Thanks again for your very kind help.

MaartenGr commented 1 year ago

You can do those things if you make sure that you also track the .topics_. With respect to interpretation, that depends on your use case. The representation might change over time, so tracking might be worthwhile if it is important to your use case.