Closed yotammarton closed 2 years ago
In general, I would not recommend removing some of the self
as they all have a specific purpose and are used throughout the model. Do this only if you are absolutely sure yourself that it is not interfering with any of the functionality of the model.
Identifying why the model is large in the first place is key and is something you should do before looking into deleting certain variables. There are a number of sub-models that can potentially be quite large:
To give you an idea of the effect of certain models:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic(verbose=True)
topics = topic_model.fit_transform(docs)
topic_model.save("full_model")
Then, you can remove certain models from BERTopic to check its effect on the model size:
umap_model = topic_model.umap_model
topic_model.umap_model = None
topic_model.save("without_umap")
In the example above, the UMAP and the embedding model are the largest (roughly 100MB and 50MB respectively). I would advise starting with UMAP first. From what I remember, the model keeps track of the raw input data as they are needed for inference, so the resulting model can be quite large. Also, you do not have to save the embedding model for each topic model as they are exactly the same: topic_model.save("full_model", save_embedding_model=False)
.
With respect to your last question, it depends on the embedding model you use. These models are trained on a large amount of data and can often handle out of vocabulary words depending on how they are trained. The resulting embeddings should be similar if synonyms are used which are then clustered and identified as belonging to a certain cluster.
Thank you for your answer @MaartenGr
I wish to update that I made some small progress with that following your guidlines, but haven't squeezed everything out of it.
In the meanwhile I save the topic_model
as is with no changes, load it and then delete some unnecessary attributes that are not being used during inference.
# Delete attributes from topic model to make it consume less RAM
del topic_model.umap_model.graph_
del topic_model.umap_model._knn_search_index._raw_data
del topic_model.c_tf_idf
del topic_model.umap_model._raw_data
Kind of usage is suitable for me because I load replicas of the same model in an inference server and I only care about the total amount of RAM consumed by all of them together (more: reducing the size of the model allows me to create more replicas of the model and thus serve more requests for inference).
For being able to inference (aka topic_model.transform
) after deleting the above, I needed to modify umap
source code as shown in my fork:
https://github.com/ymartin-mw/umap
del
lineGlad to hear that the steps taken above helped you run the model properly in your use case. Do you know how much RAM was reduced using the above method? Quite curious to see what the effect is of diving into class variables.
RAM usage:
del
commands in my previous comment - 17.0GBdel
commands, save to disk and load I encountered some errors:
File "/home/ubuntu/anaconda3/envs/user_clsf/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1169, in _init_search_function
data = self._raw_data
AttributeError: 'NNDescent' object has no attribute '_raw_data'
Try compressing and quantizing the model with tensorflow lite. should speed it up and reduce size significantly
@simonfelding Interesting, so you suggest that this should work too for BERTopic model? https://www.tensorflow.org/model_optimization/guide/quantization/post_training
Any recommendation or code snippet for BERTopic?
I thought so but it doesn't make sense. BERTopic itself is not a tensorflow based model, it just uses the embeddings :) How about just using more swap space on your server? You can use zram if you want compressed RAM.
Alternatively, if you really want to strip down bertopic, all you need is the umap model and the hdbscan model. from _bertopic.py):
def transform(self,
documents: Union[str, List[str]],
embeddings: np.ndarray = None) -> Tuple[List[int], np.ndarray]:
""" After having fit a model, use transform to predict new instances
Arguments:
documents: A single document or a list of documents to fit on
embeddings: Pre-trained document embeddings. These can be used
instead of the sentence-transformer model.
Returns:
predictions: Topic predictions for each documents
probabilities: The topic probability distribution which is returned by default.
If `calculate_probabilities` in BERTopic is set to False, then the
probabilities are not calculated to speed up computation and
decrease memory usage.
Usage:
###
### abbreviated
###
check_is_fitted(self)
check_embeddings_shape(embeddings, documents)
if isinstance(documents, str):
documents = [documents]
if embeddings is None:
embeddings = self._extract_embeddings(documents,
method="document",
verbose=self.verbose)
umap_embeddings = self.umap_model.transform(embeddings)
logger.info("Reduced dimensionality with UMAP")
predictions, probabilities = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings)
logger.info("Predicted clusters with HDBSCAN")
if self.calculate_probabilities:
probabilities = hdbscan.membership_vector(self.hdbscan_model, umap_embeddings)
logger.info("Calculated probabilities with HDBSCAN")
else:
probabilities = None
probabilities = self._map_probabilities(probabilities, original_topics=True)
predictions = self._map_predictions(predictions)
return predictions, probabilities
So if you want to really simplify the whole thing, you could just turn it into a sklearn pipeline and precompute k-nn to speed it up and lower memory requirements. It's the K-nn calculation that gives the RAM spikes as far as I can understand.
I see that UMAP also stores it's matrix as a numpy.int64 array. You could reduce the size of the array and save a lot of space. You could also try quantizing the vectors in the array, like in the tensorflow lite example. And you could try using blz to compress arrays in memory.
You can also just make your model significantly smaller by reducing UMAP dimensionality and number of topics.
And finally, you could try replacing HDBSCAN entirely for your predictions with a SVM for example, it's usually pretty fast. You already have your model, generated with BERTopic. It's a vector matrix generated by UMAP (data), and a clustering generated by HDBSCAN (labels). Split this to train and test and make a SVM that can classify a new vector without HDBSCAN.
It could work well and save you some memory by just relying on UMAP and a SVM for prediction!
Please report back, I haven't tried these things myself.
edit: actually I just might try some of these things too.
@simonfelding Good suggestions, please let us know what you were able to achieve. I will not be able to explore them.
Do note my previous comment where i suggested to del
some internal parts - which worked nicely.
I am more up to keep UMAP and HDBSCAN as-is and not to play around with such big ingredients of BERTopic algorithm.
Yep, saw that! I think it's possible to gain a lot from reducing the size of the integers in the umap model (probably halving the memory usage). I'll let you know if I get it done.
Get Outlook til Androidhttps://aka.ms/AAb9ysg
From: Yotam Martin @.> Sent: Wednesday, January 26, 2022 4:16:28 PM To: MaartenGr/BERTopic @.> Cc: simonfelding @.>; Mention @.> Subject: Re: [MaartenGr/BERTopic] Huge model with save() and large dataset (Issue #383)
@simonfeldinghttps://github.com/simonfelding Good suggestions, please let us know what you were able to achieve. I will not be able to explore them. Do note my previous comment where i suggested to del some internal parts - which worked nicely.
I am more up to keep UMAP and HDBSCAN as-is and not to play around with such big ingredients of BERTopic algorithm.
— Reply to this email directly, view it on GitHubhttps://github.com/MaartenGr/BERTopic/issues/383#issuecomment-1022297653, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKYOW76EIYSQC6CTQ57D553UYAF4ZANCNFSM5KY7NA2A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>
So I did a little investigation.
There's an object that's HUMONGOUS: topic_model.vectorizer_model
, in particular, topic_model.vectorizer_model.__dict__
.
for my 48.000 docs model:
import pympler
def bytesto(bytes, to, bsize=1024):
a = {'k' : 1, 'm': 2, 'g' : 3, 't' : 4, 'p' : 5, 'e' : 6 }
r = float(bytes)
return bytes / (bsize ** a[to])
print(f"vectorizer: {bytesto(pympler.asizeof.asizeof(topic_model.vectorizer_model), 'm')}'M'")
print(f"hdbscan: {bytesto(pympler.asizeof.asizeof(topic_model.hdbscan_model), 'm')}'M'")
print(f"umap: {bytesto(pympler.asizeof.asizeof(topic_model.umap_model._raw_data), 'm')}'M'")
print(f"c-tf-idf: {bytesto(pympler.asizeof.asizeof(topic_model.c_tf_idf), 'm')}'M'")
vectorizer: 891.5550994873047M
hdbscan: 7.0614471435546875M
umap: 93.9317626953125M
c-tf-idf: 80.65547943115234M
The vectorizer model seems to expand massively per document because vectorizer_model.__dict__
stores embeddings for all words from all documents... It's by far the largest object in the topic model, about 10x larger than the two other large objects.
@yotammarton deleting this dict will save you a huge amount of memory.
@MaartenGr is it even intentional that it is saved? I've read the bertopic code quite a few times now and I don't recall this dict is even referenced after fitting, but I may be wrong.
Yes, the CountVectorizer
needs to be saved as that specific fitted instance (including its vocabulary) is used later on in topics_over_time
and topics_per_class
. It cannot be re-fitted as that will change the bag-of-words representation which makes it difficult to then compare c-TF-IDF matrices.
Due to inactivity, this issue will be closed. If you want to discuss this further, let me know and I'll see how I can help out.
Hello, Saving a BERTopic model trained with 3.5M texts, the resulting file is some ±25GB big. I have a memory (RAM) concerns when inferencing with the model (*).
I am trying to lower the model size by (possibly) loading it, manipulating its inner
self
variables (maybe None some of them?) and save it again, hoping for a smaller model.self.c_tf_idf
for example?None
ordel self.x
) in order to reduce the model size? Referring to HDBSCAN and UMAP as well if you know. How complicated is that?c
) a sentence that is only Out-of-Vocabulary words for myembedding_model
? e.g.all-MiniLM-L6-v2
model and sentence "mothersday" is classified to the expected clusterc
, even thoughmothersday
is not part of the vocabulary (but is a top-10 cTFIDF word in thec
)() I am creating an inference server that creates replicas of the model to reduce the inference time, I want as many replicas as possible. For 2 replicas I need 2 25GB RAM machine.
Thank you :)