Open daianacric95 opened 10 months ago
Did you make sure that the number of documents in text_col
is the same as the number of dates in date
? The only way to run .topics_over_time
is by making sure that both the documents and the dates are the same size. I am assuming you have the raw data saved, so why not use that as the input for topics_over_time
?
Hi there, Maarten! Thank you so much for taking the time to answer my question. I have realized that I made a mistake during the batch processing, and I have not updated the topics as you suggested in the Online Topic Modeling section. Once I fixed that, the length of the topics and the documents is the same and I could plot the hierarchical topics. Although the problem that I am encountering now, is when I try to merge topics with the following code:
topics_to_merge = [1, 2] topic_model.merge_topics(docs, topics_to_merge)
I am getting a KeyError and I am not able to merge the topics.
I am attaching the batch-processing code for further context:
``
indices = np.arange(len(text_column))
np.random.shuffle(indices)
text_column = [text_column[i] for i in indices]
chunk_size = 10000
text_chunks = [text_column[i:i + chunk_size] for i in range(0, len(text_column), chunk_size)]
topics = []
for i in tqdm(range(len(text_chunks)), desc="Processing chunks", ):
text_chunk = text_chunks[i]
topics_chunk = model.fit_transform(text_chunk)
topics.extend(topics_chunk)
topic_model.topics_ = topics
Could you format your code with ```python so that it is easier to read? Also, could you share your full code? Moreover, could you share the full error that you received? Without it, it is difficult to understand where the issue lies.
Hello Maarten! Thank you for being so patient. Here is the full I have used and the error I am getting when trying to merge the topics:
umap_model = UMAP()
hdbscan_model = HDBSCAN()
vectorizer_model = CountVectorizer(stop_words="english")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
model = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
top_n_words=10,
language='english',
verbose=True
)
def process_batch(data):
topics, _ = model.fit_transform(data)
return topics
batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
# Process each batch
all_topics = []
for batch in batches:
batch_topics = process_batch(batch)
all_topics.extend(batch_topics)
## Merge topics
topics_to_merge = [2, 3]
topic_model.merge_topics(text_col, topics_to_merge)
Additionally, I have tried to use the package River from the Online Topic Modelling tutorial, but I keep getting the same error every time (I tried to install it locally on my Mac machine and also on Google Colab):
Update: I managed to make River work (needed to install an older version) but I am once again encountering the same issue:
from river import stream
from river import cluster
class River:
def __init__(self, model):
self.model = model
def partial_fit(self, umap_embeddings):
for umap_embedding, _ in stream.iter_array(umap_embeddings):
self.model = self.model.learn_one(umap_embedding)
labels = []
for umap_embedding, _ in stream.iter_array(umap_embeddings):
label = self.model.predict_one(umap_embedding)
labels.append(label)
self.labels_ = labels
return self
# Using DBSTREAM to detect new topics as they come in
cluster_model = River(cluster.DBSTREAM())
vectorizer_model = OnlineCountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
# Prepare model
topic_model_v2 = BERTopic(
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
)
batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
topic_model_v2.partial_fit(batch)
Additionally the results of the partial_fit are substantially different than the ones of fit_transform(), where I get 200 topics, while with partial_fit is only 18
In your code:
batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
topic_model_v2.partial_fit(batch)
You are not updating the internal topic_model_v2.topics_
which still should be done. So something like this:
batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
topic_model_v2.partial_fit(batch)
all_my_topics.extend(topic_model_v2.topics_)
topic_model_v2.topics_ = all_my_topics
Additionally the results of the partial_fit are substantially different than the ones of fit_transform(), where I get 200 topics, while with partial_fit is only 18
This is to be expected since you are using two completely different clustering models. In the former,m you are using HDBSCAN whereas the latter uses DBSTREAM. Moreover, the training procedures are different since in the former you are training UMAP on the entire dataset whilst in the latter you are training UMAP only on the very first batch.
Instead, your use case might be good for the newly introduced .merge_models
. The method allows for different topic models to be merged together. When you combine two models with this method, the first model will remain as it is and the second model will be added as long as it contains new clusters. Existing clusters will not be added since those were already found in the first model.
You can do this continuously and keep on merging models this way every time you train a new model. This means that you can use this method for incremental learning by iteratively training and emerging models. You can find more about that https://github.com/MaartenGr/BERTopic/pull/1516.
Thank you for your detailed answer! Although I have tried your suggestion, I still encounter the same issue.
Could you share your full code with the suggestion you tried? That makes communication a bit easier.
Hi there Maarten!
Thank you for all of the support you are offering the community!
I am running into the problem that my docs and topics have different lengths after I ran the model. I do not have the possibility to run the model again, but because of this issue, I cannot plot the topics over time or merge them.
Would it be possible to suggest how to proceed from here?
Thank you in advance!