MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.21k stars 767 forks source link

Huge model with save() and large dataset #383

Closed yotammarton closed 2 years ago

yotammarton commented 2 years ago

Hello, Saving a BERTopic model trained with 3.5M texts, the resulting file is some ±25GB big. I have a memory (RAM) concerns when inferencing with the model (*).

I am trying to lower the model size by (possibly) loading it, manipulating its inner self variables (maybe None some of them?) and save it again, hoping for a smaller model.

  1. I actually don't use in my application any of the cTFIDF, is it possible to delete self.c_tf_idf for example?
  2. What are all the big variables and matrices I am able to delete (and how to do so correctly? None or del self.x) in order to reduce the model size? Referring to HDBSCAN and UMAP as well if you know. How complicated is that?
  3. Regarding Q1. a bit irrelevant question to the subject, but anyway.. How BERTopic is able to classify correctly (to the expected cluster, name c) a sentence that is only Out-of-Vocabulary words for my embedding_model? e.g. all-MiniLM-L6-v2 model and sentence "mothersday" is classified to the expected cluster c, even though mothersday is not part of the vocabulary (but is a top-10 cTFIDF word in the c)

() I am creating an inference server that creates replicas of the model to reduce the inference time, I want as many replicas as possible. For 2 replicas I need 2 25GB RAM machine.

Thank you :)

MaartenGr commented 2 years ago

In general, I would not recommend removing some of the self as they all have a specific purpose and are used throughout the model. Do this only if you are absolutely sure yourself that it is not interfering with any of the functionality of the model.

Identifying why the model is large in the first place is key and is something you should do before looking into deleting certain variables. There are a number of sub-models that can potentially be quite large:

Example

To give you an idea of the effect of certain models:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic(verbose=True)
topics = topic_model.fit_transform(docs)
topic_model.save("full_model")

Then, you can remove certain models from BERTopic to check its effect on the model size:

umap_model = topic_model.umap_model
topic_model.umap_model = None
topic_model.save("without_umap")

In the example above, the UMAP and the embedding model are the largest (roughly 100MB and 50MB respectively). I would advise starting with UMAP first. From what I remember, the model keeps track of the raw input data as they are needed for inference, so the resulting model can be quite large. Also, you do not have to save the embedding model for each topic model as they are exactly the same: topic_model.save("full_model", save_embedding_model=False).

With respect to your last question, it depends on the embedding model you use. These models are trained on a large amount of data and can often handle out of vocabulary words depending on how they are trained. The resulting embeddings should be similar if synonyms are used which are then clustered and identified as belonging to a certain cluster.

yotammarton commented 2 years ago

Thank you for your answer @MaartenGr

I wish to update that I made some small progress with that following your guidlines, but haven't squeezed everything out of it. In the meanwhile I save the topic_model as is with no changes, load it and then delete some unnecessary attributes that are not being used during inference.

# Delete attributes from topic model to make it consume less RAM
del topic_model.umap_model.graph_
del topic_model.umap_model._knn_search_index._raw_data
del topic_model.c_tf_idf
del topic_model.umap_model._raw_data

Kind of usage is suitable for me because I load replicas of the same model in an inference server and I only care about the total amount of RAM consumed by all of them together (more: reducing the size of the model allows me to create more replicas of the model and thus serve more requests for inference).

For being able to inference (aka topic_model.transform) after deleting the above, I needed to modify umap source code as shown in my fork: https://github.com/ymartin-mw/umap

MaartenGr commented 2 years ago

Glad to hear that the steps taken above helped you run the model properly in your use case. Do you know how much RAM was reduced using the above method? Quite curious to see what the effect is of diving into class variables.

yotammarton commented 2 years ago

RAM usage:

  1. My server (not running anything significant) - 431MB
  2. Loading the normally-saved model - 22.3GB (Although the RAM spiked to about 30GB and then dropped, this is the behavior always).
  3. Run the 4 del commands in my previous comment - 17.0GB
simonfelding commented 2 years ago

Try compressing and quantizing the model with tensorflow lite. should speed it up and reduce size significantly

ymartin-mw commented 2 years ago

@simonfelding Interesting, so you suggest that this should work too for BERTopic model? https://www.tensorflow.org/model_optimization/guide/quantization/post_training

Any recommendation or code snippet for BERTopic?

simonfelding commented 2 years ago

I thought so but it doesn't make sense. BERTopic itself is not a tensorflow based model, it just uses the embeddings :) How about just using more swap space on your server? You can use zram if you want compressed RAM.

Alternatively, if you really want to strip down bertopic, all you need is the umap model and the hdbscan model. from _bertopic.py):

    def transform(self,
                  documents: Union[str, List[str]],
                  embeddings: np.ndarray = None) -> Tuple[List[int], np.ndarray]:
        """ After having fit a model, use transform to predict new instances
        Arguments:
            documents: A single document or a list of documents to fit on
            embeddings: Pre-trained document embeddings. These can be used
                        instead of the sentence-transformer model.
        Returns:
            predictions: Topic predictions for each documents
            probabilities: The topic probability distribution which is returned by default.
                           If `calculate_probabilities` in BERTopic is set to False, then the
                           probabilities are not calculated to speed up computation and
                           decrease memory usage.
        Usage:
        ###
        ### abbreviated
        ###

        check_is_fitted(self)
        check_embeddings_shape(embeddings, documents)

        if isinstance(documents, str):
            documents = [documents]

        if embeddings is None:
            embeddings = self._extract_embeddings(documents,
                                                  method="document",
                                                  verbose=self.verbose)

        umap_embeddings = self.umap_model.transform(embeddings)
        logger.info("Reduced dimensionality with UMAP")

        predictions, probabilities = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings)
        logger.info("Predicted clusters with HDBSCAN")

        if self.calculate_probabilities:
            probabilities = hdbscan.membership_vector(self.hdbscan_model, umap_embeddings)
            logger.info("Calculated probabilities with HDBSCAN")
        else:
            probabilities = None

        probabilities = self._map_probabilities(probabilities, original_topics=True)
        predictions = self._map_predictions(predictions)
        return predictions, probabilities

So if you want to really simplify the whole thing, you could just turn it into a sklearn pipeline and precompute k-nn to speed it up and lower memory requirements. It's the K-nn calculation that gives the RAM spikes as far as I can understand.

I see that UMAP also stores it's matrix as a numpy.int64 array. You could reduce the size of the array and save a lot of space. You could also try quantizing the vectors in the array, like in the tensorflow lite example. And you could try using blz to compress arrays in memory.

You can also just make your model significantly smaller by reducing UMAP dimensionality and number of topics.

And finally, you could try replacing HDBSCAN entirely for your predictions with a SVM for example, it's usually pretty fast. You already have your model, generated with BERTopic. It's a vector matrix generated by UMAP (data), and a clustering generated by HDBSCAN (labels). Split this to train and test and make a SVM that can classify a new vector without HDBSCAN.

It could work well and save you some memory by just relying on UMAP and a SVM for prediction!

Please report back, I haven't tried these things myself.

edit: actually I just might try some of these things too.

yotammarton commented 2 years ago

@simonfelding Good suggestions, please let us know what you were able to achieve. I will not be able to explore them. Do note my previous comment where i suggested to del some internal parts - which worked nicely.

I am more up to keep UMAP and HDBSCAN as-is and not to play around with such big ingredients of BERTopic algorithm.

simonfelding commented 2 years ago

Yep, saw that! I think it's possible to gain a lot from reducing the size of the integers in the umap model (probably halving the memory usage). I'll let you know if I get it done.

Get Outlook til Androidhttps://aka.ms/AAb9ysg


From: Yotam Martin @.> Sent: Wednesday, January 26, 2022 4:16:28 PM To: MaartenGr/BERTopic @.> Cc: simonfelding @.>; Mention @.> Subject: Re: [MaartenGr/BERTopic] Huge model with save() and large dataset (Issue #383)

@simonfeldinghttps://github.com/simonfelding Good suggestions, please let us know what you were able to achieve. I will not be able to explore them. Do note my previous comment where i suggested to del some internal parts - which worked nicely.

I am more up to keep UMAP and HDBSCAN as-is and not to play around with such big ingredients of BERTopic algorithm.

— Reply to this email directly, view it on GitHubhttps://github.com/MaartenGr/BERTopic/issues/383#issuecomment-1022297653, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKYOW76EIYSQC6CTQ57D553UYAF4ZANCNFSM5KY7NA2A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>

simonfelding commented 2 years ago

So I did a little investigation. There's an object that's HUMONGOUS: topic_model.vectorizer_model, in particular, topic_model.vectorizer_model.__dict__.

for my 48.000 docs model:

import pympler
def bytesto(bytes, to, bsize=1024): 
    a = {'k' : 1, 'm': 2, 'g' : 3, 't' : 4, 'p' : 5, 'e' : 6 }
    r = float(bytes)
    return bytes / (bsize ** a[to])
print(f"vectorizer: {bytesto(pympler.asizeof.asizeof(topic_model.vectorizer_model), 'm')}'M'")
print(f"hdbscan: {bytesto(pympler.asizeof.asizeof(topic_model.hdbscan_model), 'm')}'M'")
print(f"umap: {bytesto(pympler.asizeof.asizeof(topic_model.umap_model._raw_data), 'm')}'M'")
print(f"c-tf-idf: {bytesto(pympler.asizeof.asizeof(topic_model.c_tf_idf), 'm')}'M'")
vectorizer: 891.5550994873047M
hdbscan: 7.0614471435546875M
umap: 93.9317626953125M
c-tf-idf: 80.65547943115234M

The vectorizer model seems to expand massively per document because vectorizer_model.__dict__ stores embeddings for all words from all documents... It's by far the largest object in the topic model, about 10x larger than the two other large objects. @yotammarton deleting this dict will save you a huge amount of memory. @MaartenGr is it even intentional that it is saved? I've read the bertopic code quite a few times now and I don't recall this dict is even referenced after fitting, but I may be wrong.

MaartenGr commented 2 years ago

Yes, the CountVectorizer needs to be saved as that specific fitted instance (including its vocabulary) is used later on in topics_over_time and topics_per_class. It cannot be re-fitted as that will change the bag-of-words representation which makes it difficult to then compare c-TF-IDF matrices.

MaartenGr commented 2 years ago

Due to inactivity, this issue will be closed. If you want to discuss this further, let me know and I'll see how I can help out.