MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.01k stars 752 forks source link

Documents and Topics are different lengths and cannot merge the topics #1626

Open daianacric95 opened 10 months ago

daianacric95 commented 10 months ago

Hi there Maarten!

Thank you for all of the support you are offering the community!

I am running into the problem that my docs and topics have different lengths after I ran the model. I do not have the possibility to run the model again, but because of this issue, I cannot plot the topics over time or merge them.

Would it be possible to suggest how to proceed from here?

Screenshot 2023-11-10 at 14 35 04

Thank you in advance!

MaartenGr commented 10 months ago

Did you make sure that the number of documents in text_col is the same as the number of dates in date? The only way to run .topics_over_time is by making sure that both the documents and the dates are the same size. I am assuming you have the raw data saved, so why not use that as the input for topics_over_time?

daianacric95 commented 10 months ago

Hi there, Maarten! Thank you so much for taking the time to answer my question. I have realized that I made a mistake during the batch processing, and I have not updated the topics as you suggested in the Online Topic Modeling section. Once I fixed that, the length of the topics and the documents is the same and I could plot the hierarchical topics. Although the problem that I am encountering now, is when I try to merge topics with the following code:
topics_to_merge = [1, 2] topic_model.merge_topics(docs, topics_to_merge) I am getting a KeyError and I am not able to merge the topics. I am attaching the batch-processing code for further context: ``

indices = np.arange(len(text_column))
np.random.shuffle(indices)
text_column = [text_column[i] for i in indices]

chunk_size = 10000
text_chunks = [text_column[i:i + chunk_size] for i in range(0, len(text_column), chunk_size)]
topics = []

for i in tqdm(range(len(text_chunks)), desc="Processing chunks", ):
    text_chunk = text_chunks[i]
    topics_chunk = model.fit_transform(text_chunk)
    topics.extend(topics_chunk)

topic_model.topics_ = topics
MaartenGr commented 10 months ago

Could you format your code with ```python so that it is easier to read? Also, could you share your full code? Moreover, could you share the full error that you received? Without it, it is difficult to understand where the issue lies.

daianacric95 commented 10 months ago

Hello Maarten! Thank you for being so patient. Here is the full I have used and the error I am getting when trying to merge the topics:

umap_model = UMAP()
hdbscan_model = HDBSCAN()
vectorizer_model = CountVectorizer(stop_words="english")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    top_n_words=10,
    language='english',
    verbose=True
)

def process_batch(data):
    topics, _ = model.fit_transform(data)
    return topics

batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]

# Process each batch
all_topics = []
for batch in batches:
    batch_topics = process_batch(batch)
    all_topics.extend(batch_topics)

## Merge topics

topics_to_merge = [2, 3]
topic_model.merge_topics(text_col, topics_to_merge)
Screenshot 2023-11-18 at 12 55 00

Additionally, I have tried to use the package River from the Online Topic Modelling tutorial, but I keep getting the same error every time (I tried to install it locally on my Mac machine and also on Google Colab):

Screenshot 2023-11-18 at 11 59 56
daianacric95 commented 10 months ago

Update: I managed to make River work (needed to install an older version) but I am once again encountering the same issue:

from river import stream
from river import cluster

class River:
    def __init__(self, model):
        self.model = model

    def partial_fit(self, umap_embeddings):
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            self.model = self.model.learn_one(umap_embedding)

        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.model.predict_one(umap_embedding)
            labels.append(label)

        self.labels_ = labels
        return self

# Using DBSTREAM to detect new topics as they come in
cluster_model = River(cluster.DBSTREAM())
vectorizer_model = OnlineCountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)

# Prepare model
topic_model_v2 = BERTopic(
    hdbscan_model=cluster_model, 
    vectorizer_model=vectorizer_model, 
    ctfidf_model=ctfidf_model,
)

batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
    topic_model_v2.partial_fit(batch)
Screenshot 2023-11-18 at 22 25 54

Additionally the results of the partial_fit are substantially different than the ones of fit_transform(), where I get 200 topics, while with partial_fit is only 18

MaartenGr commented 10 months ago

In your code:

batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
    topic_model_v2.partial_fit(batch)

You are not updating the internal topic_model_v2.topics_ which still should be done. So something like this:

batch_size = 1000
batches = [text_col[i:i + batch_size] for i in range(0, len(text_col), batch_size)]
all_my_topics=[]
# Incrementally fit the topic model by training on 1000 documents at a time
for batches in batches:
    topic_model_v2.partial_fit(batch)
    all_my_topics.extend(topic_model_v2.topics_)

topic_model_v2.topics_ = all_my_topics

Additionally the results of the partial_fit are substantially different than the ones of fit_transform(), where I get 200 topics, while with partial_fit is only 18

This is to be expected since you are using two completely different clustering models. In the former,m you are using HDBSCAN whereas the latter uses DBSTREAM. Moreover, the training procedures are different since in the former you are training UMAP on the entire dataset whilst in the latter you are training UMAP only on the very first batch.

Instead, your use case might be good for the newly introduced .merge_models. The method allows for different topic models to be merged together. When you combine two models with this method, the first model will remain as it is and the second model will be added as long as it contains new clusters. Existing clusters will not be added since those were already found in the first model.

You can do this continuously and keep on merging models this way every time you train a new model. This means that you can use this method for incremental learning by iteratively training and emerging models. You can find more about that https://github.com/MaartenGr/BERTopic/pull/1516.

daianacric95 commented 9 months ago

Thank you for your detailed answer! Although I have tried your suggestion, I still encounter the same issue.

MaartenGr commented 9 months ago

Could you share your full code with the suggestion you tried? That makes communication a bit easier.