MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.97k stars 745 forks source link

Online Modeling - Topic Representation Loss and Mapping Confusion #946

Closed emarsc closed 1 year ago

emarsc commented 1 year ago

I am having a couple of issues with online topic modeling. I have read all of the relevant documentation (I think), but I am still unsure if what I am experiencing is a bug, or if what I am trying to do is simply not supported. Seeking some clarification.

Use case

My use case involves continuously training a model on new data (always running, millions of documents a day).

  1. I have one process running that ingests data, runs partial fit with new documents, and updates all document topic ids and associated tokens on an external system. I am running partial fit with batches of 200,000 documents every hour. After training and updating the documents, the model is saved to disk.

The model is initialized as:

options = {
     "calculate_probabilities": False,
     "verbose": True,
     "language": "english",
     "hdbscan_model": MiniBatchKMeans(n_clusters=50000, random_state=0),
     "umap_model": IncrementalPCA(n_components=20),
     "vectorizer_model": OnlineCountVectorizer(stop_words="english", decay=.01)
 }
model = BERTopic(**options)
model.save(FILE_PATH, save_embedding_model=False)  
...

Fitting:

def fit(docs):  
  model = BERTopic.load(FILE_PATH, embedding_model=select_backend(SentenceTransformer("all-MiniLM-L6-v2"))
  model.partial_fit(docs)
  predictions = model.topics_
  #... 
  model.save(FILE_PATH, save_embedding_model=False
  1. I have a second process that continuously refreshes the model from disk, and is used to find topics relative to search terms.
...
model = BERTopic.load(FILE_PATH, embedding_model=select_backend(SentenceTransformer("all-MiniLM-L6-v2"))
...
prediction = model.find_topics(search_term, top_n=10)

Issue

I have two main issues / points of confusion.

  1. The topics returned from find_topics and classified in partial_fit are often inconsistent between training batches. In other words, the same search term or document will be classified differently after partial_fit is called again. I expect that the topics will be updated and documents will be classified differently from one batch to the next (as that is the point), but it seems that they are sometimes wildly different. For example, the document "The economy" will be classified as topic x in batch 1. It is accurately clustered with other documents that also relate to the economy. However, somewhere along the way in batch 1+n, topic x will no longer have anything to do with "the economy". I would understand a gradual change, but I observe that some topics are getting abruptly re-assigned to something that does not have any similarity to the original topic after every training run.

Here is an example:

`

predictions = model.find_topics("the economy", top_n=10) predictions ([46208, 47222, 47900, 22422, 32983, 33172, 3830, 44303, 22733, 47659], [0.6486664569094811, 0.6419729018417841, 0.6387690526032911, 0.6194511282750119, 0.6178553942750709, 0.6138108302907386, 0.6036604607508025, 0.6020307796162376, 0.5989599348611652, 0.5979453314862377]) model.partial_fit(samples) predictions2 = model.find_topics("the economy", top_n=10) predictions2 ([22679, 41496, 42992, 15108, 42784, 47419, 3117, 22091, 25306, 42033], [0.2788958219845341, 0.2788958219845341, 0.2788958219845341, 0.2788958219845341, 0.2788958219845341, 0.2788958219845341, 0.2788958219845341, 0.2788958219845341, 0.2788958219845341, 0.2788958219845341]) `

I have tried to read the code and documentation on topicmapper. It seems this is how the shift in topics should be mapped? I am not having much luck in figuring out how to leverage that data structure. The changed topics don't seem to map to anything.

  1. My second issue is that it seems that topic sizes are maintained from one batch to the next, but the rest of their representation is not. The tokens (words) associated with topics only seem to include the vocabulary from the most recent training run. I assume that the cluster centroids are maintained (have not confirmed this), but I would also expect that the vocab is incrementally updated. From the documentation, I understand that certain features require that the topics_ attribute be maintained and updated after each training run, but I am unsure if it relates to this specific issue or not. Is there a way to maintain the topic info (label, tokens) from one batch to the next?

...

Thank you for taking the time. Clarification on whether or not what I am experiencing is expected would be greatly appreciated.

Also, this package rocks. Great work.

MaartenGr commented 1 year ago

Thank you for your kind words and the extensive description!

The topics returned from find_topics and classified in partial_fit are often inconsistent between training batches. In other words, the same search term or document will be classified differently after partial_fit is called again. I expect that the topics will be updated and documents will be classified differently from one batch to the next (as that is the point), but it seems that they are sometimes wildly different. For example, the document "The economy" will be classified as topic x in batch 1. It is accurately clustered with other documents that also relate to the economy. However, somewhere along the way in batch 1+n, topic x will no longer have anything to do with "the economy". I would understand a gradual change, but I observe that some topics are getting abruptly re-assigned to something that does not have any similarity to the original topic after every training run.

What is happening here is that .find_topics is a very different method from assigning topics than .transform. More specifically, .find_topics is merely meant as a way to find topics when you have generated a couple of hundred of them. For example, if you want to find all topics that are semantically similar to health, you can just use .find_topics("health") in order to find them. However, if you actually want to assign topics to documents, then it is not advised to use this method. Instead, I would advise to use .transform on your documents as it will contain nearly the same process as fitting them. In order words, the pipeline would then be:

# We perform a partial fit on our documents. 
model.partial_fit(docs)
predictions = model.topics_

# Afte loading the model, we can assign topics as follows:
predictions = model.transform(unseen_docs)

My second issue is that it seems that topic sizes are maintained from one batch to the next, but the rest of their representation is not. The tokens (words) associated with topics only seem to include the vocabulary from the most recent training run. I assume that the cluster centroids are maintained (have not confirmed this), but I would also expect that the vocab is incrementally updated. From the documentation, I understand that certain features require that the topics_ attribute be maintained and updated after each training run, but I am unsure if it relates to this specific issue or not. Is there a way to maintain the topic info (label, tokens) from one batch to the next?

You are using the decay=.01 parameter in the OnlineCountVectorizer. This means that if certain words have not been used in the most recent .partial_fit batches, then those words will be removed from the topic representation and the Bag-of-Words. In order to maintain those words, you can either not set the decay parameter or you can lower its value such that it will take a much longer time for those words to decay.

emarsc commented 1 year ago

What is happening here is that .find_topics is a very different method from assigning topics than .transform. More specifically, .find_topics is merely meant as a way to find topics when you have generated a couple of hundred of them. For example, if you want to find all topics that are semantically similar to health, you can just use .find_topics("health") in order to find them. However, if you actually want to assign topics to documents, then it is not advised to use this method. Instead, I would advise to use .transform on your documents as it will contain nearly the same process as fitting them.

For my use case, we can assume that I have infinite documents coming in. So, I thought the best way to approach this would be to classify them and fit the model at the same time. In that sense, there are no "unseen docs". With other variations, this approach has been working quite well for me. I am using "find_topics" to satisfy search queries, not to classify.

I was more specifically wondering how topics change from one batch to the next. In successive runs of partial fit with new documents, how do the previously created topics change? I have observed that sometimes they stay the same, and sometimes they change to be very different. In other words, can I expect that documents classified in topic 1 in partial_fit run 1 to be similar to documents classified in topic 1 in partial_fit run 10?

You are using the decay=.01 parameter in the OnlineCountVectorizer. This means that if certain words have not been used in the most recent .partial_fit batches, then those words will be removed from the topic representation and the Bag-of-Words. In order to maintain those words, you can either not set the decay parameter or you can lower its value such that it will take a much longer time for those words to decay.

Ah. Thank you! I will look more closely at OnlineCountVectorizer. Do you see any reason that I shouldn't set it to 0?

MaartenGr commented 1 year ago

In other words, can I expect that documents classified in topic 1 in partial_fit run 1 to be similar to documents classified in topic 1 in partial_fit run 10?

No, that actually goes against the "online" part of online machine learning. The idea here is that when you perform batches of training the model learns more and more as it gets more and more information. As such, what it has learned in batch 1 might not be relevant anymore in batch 10. Moreover, what it learned in batch 1 might actually be incorrect since the model only has limited data and by batch 10 it knows the correct representation.

Ah. Thank you! I will look more closely at OnlineCountVectorizer. Do you see any reason that I shouldn't set it to 0?

The idea with online machine learning is often that you want to classify the most current information and that information from years ago might be less relevant. Having some sort of decay factor helps put an emphasis on current data.

vantubbe commented 1 year ago

@MaartenGr The options confuse me a bit. MiniBatchKMeans is used for clustering. For online learning shouldn't one use river? I was under the assumption that MiniBatchKMeans wouldn't work well for continuous learning. Am I missing something here?

MaartenGr commented 1 year ago

@vantubbe Although MiniBatchKMeans is a clustering algorithm, it takes in batches of data which allows it for online learning of the clusters. River also uses clustering algorithms but optimized for online use cases.

vantubbe commented 1 year ago

That makes sense, thank you for the clarification! Could you please clarify one additional thing: a higher decay = preexisting words decay faster.

From your documentation on the OnlineCountVectorizer:

For example, a value of .1 will decrease the frequencies in the bag-of-words matrix by 10% at each iteration before adding the new bag-of-words matrix.

So if decay is set to 0.01, that means the frequency will be decreased by only 1% percent each iteration. However your above comment seems to imply that it could be too quickly decaying.

You are using the decay=.01 parameter in the OnlineCountVectorizer. This means that if certain words have not been used in the most recent .partial_fit batches, then those words will be removed from the topic representation and the Bag-of-Words. In order to maintain those words, you can either not set the decay parameter or you can lower its value such that it will take a much longer time for those words to decay.

I would assume that 0.01 implies a fairly slow decay, although I also assume the frequency at which one iterates training batches needs to be considered too.

Thank you again for the help!

MaartenGr commented 1 year ago

So if decay is set to 0.01, that means the frequency will be decreased by only 1% percent each iteration. However your above comment seems to imply that it could be too quickly decaying.

I would assume that 0.01 implies a fairly slow decay, although I also assume the frequency at which one iterates training batches needs to be considered too.

The impact of decay not only depends on its value but also on the number of batches you run. For example, if you have a decay of 10% and you run only 2 batches, then the impact will not be as big. However, if you have a decay of .01 % and you have a million batches, then that quickly adds up!

emarsc commented 1 year ago

@MaartenGr

I started experimenting with the river approach you gave as an example in the docs. This isn't necessarily relevant to the original issue here, but I figured it may be helpful to point out a slightly improved custom River class (at least for my use case).

I found the quality to be high with DBSTREAM. However, the processing time gets a bit out of control for large data sets. river.cluster.DBSTREAM calls self._recluster on every pass of predict_one. I found success circumventing this and implementing the "partial_fit" function a bit differently than the example in the docs.

class River:
    def __init__(self, model):
        self.model = model

    def custom_predict(self, x):
        #code copied from https://github.com/online-ml/river/blob/9ade1fef587f704715e987bb96a06d5c9005c1ab/river/cluster/dbstream.py#L380-L391

        min_distance = math.inf
        closest_cluster_index = 0
        for i, center_i in self.model.centers.items():
            distance = self.model._distance(center_i, x)
            if distance < min_distance:
                min_distance = distance
                closest_cluster_index = i
        return closest_cluster_index

    def partial_fit(self, umap_embeddings):
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            self.model = self.model.learn_one(umap_embedding)

        self.model._recluster()
        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.custom_predict(umap_embedding)
            labels.append(label)
        self.labels_ = labels
        return self

cluster_model = River(cluster.DBSTREAM())

... The speed up is huge. I think we can close this issue. However, if you can see any problems with the way I have done this, let me know!

... Looks like this is being actively discussed on the river repo https://github.com/online-ml/river/issues/1086

vantubbe commented 1 year ago

@emarsc Great feedback and thank you for including the above code.

When using River, would you mind sharing which dimensionality reduction you chose to use in your Bertopic model?

I originally used UMAP but that does not support incremental learning and I assume is not a great fit for use with River. When I switched to IncrementalPCA(n_components=5) the clustering results were very poor. Increasing n_components to 150 gave somewhat better results but obviously very slow. I'm curious on how you setup the different pieces of your Bertopic model. Thanks in advance!

emarsc commented 1 year ago

@emarsc Great feedback and thank you for including the above code.

When using River, would you mind sharing which dimensionality reduction you chose to use in your Bertopic model?

I originally used UMAP but that does not support incremental learning and I assume is not a great fit for use with River. When I switched to IncrementalPCA(n_components=5) the clustering results were very poor. Increasing n_components to 150 gave somewhat better results but obviously very slow. I'm curious on how you setup the different pieces of your Bertopic model. Thanks in advance!

@vantubbe I am working through the same issue! As expected, I think its going to take some time to figure it out and it will depend on your use case.

I am using incremental PCA along with river's DBSTREAM. I am unable to find a more suitable dimension reduction algorithm. Incremental UMAP might be possible and is being discussed, but doesn't seem to be implemented yet.

For my data, with 5 components, the default DBSTREAM configuration was unable to find any meaningful clusters (if any clusters at all).

I changed the DBSTREAM clustering_threshold parameter from 1 (default) to 0.5 and I am getting meaningful clusters. I have also increased the components to anywhere from 10 to 25 and this seems to help as well. In combination, this seems to be satisfactory for me, but I have not fully validated it yet.

I am going to be looking a bit further into the DBSTREAM algorithm and parameters. I will let you know if I find more success!

I do not recall where, but I think I read something from @MaartenGr suggesting you could train UMAP on a sufficiently large data set and still use it with online clustering if you don't expect your data to change much .... all probably depends on your use case.

vantubbe commented 1 year ago

@emarsc Good deal, thanks for the details! I was also looking for incremental UMAP and found AlignedUMAP. It's an implementation of UMAP intended for temporal/online learning. Unfortunately I have no idea how to use it - requires a relations parameter. From the docs I gather it's training by overlapping "slices". I think relations is looking for a map of two consecutive slice's overlaps (per iteration). Goes right over my head. Maybe you'll have better luck.

I changed the DBSTREAM clustering_threshold parameter from 1 (default) to 0.5

I did the same! I had to set clustering_threshold and intersection_factor much lower when using IPCA (vs UMAP). Otherwise would not get meaningful clusters.

Thanks again, excited to hear about future breakthroughs!

mdcox commented 1 year ago

@emarsc I was reviewing this issue and I have a similar use case, I am looking to initially model on a dataset, say ~10k to 100k docs and then using the River package I would like to update the saved topic model twice per month with another ~10-20k docs.

I have been testing the river functionality for this and I have tried to implement you river speed up (https://github.com/MaartenGr/BERTopic/issues/946#issuecomment-1428896166) but regardless of the custom River class I use, the clustering never finishes. I have tested it with ~2k docs and it appears to work, although once I expand to ~30k docs, the modeling doesn't complete even after 18+ hours. Now, typcially, this would take ~30min to 90min when doing normal incremental modeling (90 min when using closer to 100k docs).

Could there be something I am missing here for the speed ups, or is there something else that you have found that helps improve this speed up? I appreciate any suggestions or new ideas! Also, a side note, I am using GPU acceleration to improve speed!

For reference, here is my current River class & modeling code:

umap_model = IncrementalPCA(n_components=20)
cluster_model = River(cluster.DBSTREAM(
    clustering_threshold=0.1, intersection_factor=0.25))
model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=cluster_model,
    vectorizer_model=get_vectorizer(),
    top_n_words=topic_num_words,   # Default is 10
    verbose=True,
)

from river import stream
import math

class River:
    def __init__(self, model):
        self.model = model

    def predict(self, x):
        # code copied from https://github.com/online-ml/river/blob/9ade1fef587f704715e987bb96a06d5c9005c1ab/river/cluster/dbstream.py#L380-L391

        min_distance = math.inf
        closest_cluster_index = 0
        for i, center_i in self.model.centers.items():
            distance = self.model._distance(center_i, x)
            if distance < min_distance:
                min_distance = distance
                closest_cluster_index = i
        return closest_cluster_index

    def partial_fit(self, umap_embeddings):
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            self.model = self.model.learn_one(umap_embedding)

        self.model._recluster()
        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.predict(umap_embedding)
            labels.append(label)
        self.labels_ = labels
        return self

# Supporting code...
def get_vectorizer():

    # Stop words. See https://github.com/MaartenGr/BERTopic/issues/181
    vectorizer_model = OnlineCountVectorizer(stop_words=text.ENGLISH_STOP_WORDS, decay=.01)

    return vectorizer_model
MaartenGr commented 1 year ago

@mdcox It might be worthwhile to check where the model slows down. You have several lines of code in your .partial_fit function of our River class that all do something. Identifying where it slows down is the first step.

mdcox commented 1 year ago

@MaartenGr good note (back to the basics!) thank you for the suggestion!

I have added tqdm to the code to track the timing of each part. The estimated tqdm completion time looks to jump around a lot. I included two screenshots below as an example, one second it is ~1.5 hours and then it jumps to ~5 hours. It also appears to be slowly getting longer, ie the time estimates are generally increasing (ie by 10% the minimum estimate doesn't dip below ~3 hours whereas before it was ~1 hour). The top end of the estimate also grows to ~10 hours whereas before it was ~5 hours.

This is all in the first loop of the partial_fit function being executed in the River class (below).

for umap_embedding, _ in tqdm(iterable_umap_embeddings, total=len(iterable_umap_embeddings)):
            self.model = self.model.learn_one(umap_embedding)

@MaartenGr have you seen slowdowns in the partial_fit like this?

image

image

image

MaartenGr commented 1 year ago

@mdcox I am not entirely sure but it might be related to https://github.com/online-ml/river/issues/1086. From my side, I do not think there is much to do aside from using a different clustering algorithm. Since it seems to be algorithm-specific, it might be worthwhile to ask the maintainer of the River package for help.

mdcox commented 1 year ago

Sounds good! Thank you for the help!