Closed nfsedaghat closed 1 year ago
That is strange indeed! That might be a result of the clustering model but I cannot be sure. Could you share your full code?
This is the whole code:
from river import stream
from river import cluster
class River:
def init(self, model):
self.model = model
def partial_fit(self, umap_embeddings):
for umap_embedding, _ in stream.iter_array(umap_embeddings):
self.model = self.model.learn_one(umap_embedding)
labels = []
for umap_embedding, _ in stream.iter_array(umap_embeddings):
label = self.model.predict_one(umap_embedding)
labels.append(label)
self.labels_ = labels
return self
cluster_model = River(cluster.DBSTREAM())
vectorizer_model = OnlineCountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
bert_model = BERTopic(embedding_model=embedding_model,
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
verbose = False, top_n_words = 30, low_memory = True)
chunkSize = 100
doc_chunks = [df.loc[i:i+chunkSize-1, 'Description'] for i in range(0, len(df), chunkSize)]
topics = []
ll = 0
for docs in doc_chunks:
ll += len(docs)
topic_model.partial_fit(docs)
print('Number of docs as input to the model: ', len(docs), '\n', 'Number of returned topics: ', len(topic_model.topics_), '\n \n')
print('**********')
topics.extend(topic_model.topics_)
I am not entirely sure what is happening here. Could you try it with a different clustering algorithm instead?
Thanks for your suggestion. I replaced DBSTREAM
with CluStream
and it worked!
Just one more question:
In online topic modeling can we plot hierarchical structure of topics as well as topic evolution over time plots every time we feed in the model with new data? How we can interpret them as the topics can change in each model updating?
Thanks again for your very kind help.
You can do those things if you make sure that you also track the .topics_
. With respect to interpretation, that depends on your use case. The representation might change over time, so tracking might be worthwhile if it is important to your use case.
Hi, I am going to use online topic modeling in my project on a data with 360 documents. I use the below code as I copied it from the BERTopic webpage:
` # Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration chunkSize = 100 doc_chunks = [df.loc[i:i+chunkSize-1, 'Description'] for i in range(0, len(df), chunkSize)]
` After running this snippet, this is what I get:
As you see in the second chunck while the number of documents is 100, the number of returned topics is 95. Why such thing happens? I need to mention that sometimes it runs perfectly but sometimes this case occurs.