MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.15k stars 764 forks source link

Saving BERTopic model when using Parametric UMAP #1547

Open mohammadm1985 opened 1 year ago

mohammadm1985 commented 1 year ago

Hi,

Thank you so much for all your bits of help. I created a model that suits all my needs and currently, the results are as expected. I need to save the model, load it, and transform the new data each month. I use Parametric UMAP instead of the original UMAP for dimensionality reduction as the parametric one produces deterministic results and is not batch-dependent. I am very satisfied with the outcome. However, the issue is that I cannot save the model. Whatever I do to save the model I fail. I was wondering if I could save the dimensionality reduction model (umap_model component of the bertopic) independently and replace it once I load the trained clustering model without disturbing the entire model. Do you have any advice? This is the last stage of my project and if I cannot save the model all my efforts will be in vain. I would really appreciate it if you could provide me with some options that may resolve this issue.

P.S.: When I try the safetensor or pytorch approach I get an error in loading:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_331/1509421400.py in <module>
      1 from bertopic import BERTopic
----> 2 loaded_model = BERTopic.load('/home/mmotall/complaints_subcat/model_training/models_developped/bert/model-save-test/topics_model')

~/venv/lib/python3.7/site-packages/bertopic/_bertopic.py in load(cls, path, embedding_model)
   3006         else:
   3007             raise ValueError("Make sure to either pass a valid directory or HF model.")
-> 3008         topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
   3009 
   3010         # Replace embedding model if one is specifically chosen

~/venv/lib/python3.7/site-packages/bertopic/_bertopic.py in _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
   4022 
   4023         # CountVectorizer
-> 4024         topic_model.vectorizer_model = CountVectorizer(**ctfidf_config["vectorizer_model"]["params"])
   4025         topic_model.vectorizer_model.vocabulary_ = ctfidf_config["vectorizer_model"]["vectorizer_model"]["vocab"]
   4026 

~/venv/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

TypeError: __init__() got an unexpected keyword argument 'norm'

When I save the model as pickle, every aspect of the model is saved correctly other than dimensionality reduction model (parametric UMAP). Therefore I was thinking if I can save the parametric UMAP independently and later join it to the loaded bertopic model. Is it possible?

MaartenGr commented 1 year ago

Could you share your full code for training, saving, and loading your BERTopic model? That will make it easier to debug what is happening here. I would generally advise using safetensors instead, but you mention that it is also not working. Likewise, could you share the full code including the error logs from both examples?

mohammadm1985 commented 1 year ago

Thank you for your prompt response. The reason that I cannot use the SafeTensor or PyTorch is that these methods do not preserve the model components including umap. The nature of my work requires me to load the entire model and transform the new embeddings. I am training on historical data, then I need to load the model and run transform on new embeddings each month. Meaning that I need to preserve the dimension reduction model that I am using. This is why I need to use pickle-saving approach. I realized that the pickle is working well with the original UMAP function, however, I am using the Parametric UMAP as it is more stable in transforming the new data and is not batch dependent. Parametric UMAP includes tf components (neural networks encoder) that are not pickleable. Therefore, I decided to save the entire bertopic model as a pickle and when loading I swap the topic_model.umap_model with the trained parametric umap that I saved using the approach mentioned in parametric_umap save.

Let me share my entire model first, and show you how I want to load the model again in another notebook.

This is the code for bertopic traning:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from hdbscan import HDBSCAN
from umap import UMAP
from umap.parametric_umap import ParametricUMAP
import tensorflow as tf

import random
random_state = 42
np.random.seed(random_state)
tf.random.set_seed(random_state)
random.seed(random_state)

# Define an encoder for parametric UMAP. Parametric umap is almost the same as UMAP but uses a neural network to optimize and preserve the structure of the constructed graph.

def encoder_func_nd(dim = 50):
    init = tf.initializers.HeNormal()
    alpha_val = 0.001
    dims = (768,)
    encoder = tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape = dims),
        tf.keras.layers.Dense(units=650, kernel_initializer=init),
        tf.keras.layers.LeakyReLU(alpha=alpha_val),
        .... other layers ...
        tf.keras.layers.Dense(units=dim),
    ])
    return encoder

n_neighbors_in = 15
n_components_in = 50

# I use this 3D reducer for plotting purposes:

Reducer_3D = ParametricUMAP(encoder = encoder_func_nd(dim = 3),
                            n_components=3,
                            n_neighbors = n_neighbors_in,
                            min_dist = 0,
                            metric='cosine',
                            spread = 0.5,
                            unique = False,
                            transform_queue_size = 50,
                            negative_sample_rate = 50,
                            n_jobs = 50,
                            angular_rp_forest=True,
                            random_state = random_state,
                            transform_seed = random_state,
                              )

embeddings_reduced3D = Reducer_3D.fit_transform(embeddings)

Reducer_nD = ParametricUMAP(encoder = encoder_func_nd(dim = 50),
                            n_components=50,
                            n_neighbors = n_neighbors_in,
                            min_dist = 0,
                            metric='cosine',
                            spread = 0.5,
                            unique = False,
                            transform_queue_size = 50,
                            negative_sample_rate = 50,
                            n_jobs = 50,
                            angular_rp_forest=True,
                            random_state = random_state,
                            transform_seed = random_state,
                              )

clusterer_model = HDBSCAN(min_cluster_size = 14,
                          min_samples = 1,
                          cluster_selection_epsilon = 0,
                          cluster_selection_method = "eom",
                          prediction_data=True,
                          approx_min_span_tree = False)

topic_model = BERTopic(embedding_model = sentence_model,
                       verbose = True,
                       top_n_words = 20,
                       n_gram_range = (1, 2),
                       ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True),
                       vectorizer_model= TfidfVectorizer(stop_words=SWV, ngram_range=(1, 2), vocabulary = vocabulary, min_df=5),
                       umap_model = Reducer_nD,
                       hdbscan_model = clusterer_model,
                       calculate_probabilities = True,
                       # representation_model = [MaximalMarginalRelevance(diversity=0.1), KeyBERTInspired()]
                      )

topics, probs = topic_model.fit_transform(docs_t, embeddings

n_outlier = topic_model.get_topic_info()[topic_model.get_topic_info()["Topic"] == -1]["Count"][0]

print(f"Number of Outliers: {n_outlier}")

fig = topic_model.visualize_documents(docs_t,
                                topics = topic_model.topics_,
                                embeddings = embeddings,
                                reduced_embeddings =  embeddings_reduced3D,
                                sample = 1,
                                hide_annotations = True,
                                hide_document_hover = False,
                                custom_labels = False,
                                title= "<b>Documents and Topics</b>",
                                width= 1500,
                                height= 750)

After I trained the model I want to save the model as:

import pickle
os.chdir('/directory/to/save/')
pkl_name = 'model.pkl'
with open(pkl_name, 'wb') as file:
     pickle.dump(topic_model, file)

Now I have to save the parametric UMAP and substitute it with the topic_model.umap_model in the pickled model after loading it in another notebook:

This is from the parametric umap webpage: (https://umap-learn.readthedocs.io/en/latest/parametric_umap.html)


_Saving and loading your model Unlike non-parametric UMAP Parametric UMAP cannot be saved simply by pickling the UMAP object because of the Keras networks it contains. To save Parametric UMAP, there is a built-in function:

embedder.save('/your/path/here')

You can then load parametric UMAP elsewhere:

from umap.parametric_umap import load_ParametricUMAP
embedder = load_ParametricUMAP('/your/path/here')

This loads both the UMAP object and the parametric networks it contains._


This is why I used this:

Reducer_3D.save('/directory/to/save/)

To save the parametric_umap I use for plotting and:

Topic_model.umap_model.save('/directory/to/save/)

To save the umap_model embedded and already trained in the topic_model.

My plan is to load the models as follows and join the saved parametric UMAP to the pickled model:

from umap.parametric_umap import load_ParametricUMAP
Reducer_3D = load_ParametricUMAP('/saved/parametric_umap3d')
Reducer_nD = load_ParametricUMAP('/saved/parametric_umapnd')

And rejoin it to the loaded pickle:

import pickle
os.chdir('/directory/with/saved/bertopic/model')
pkl_name = 'model.pkl'
with open(pkl_name, 'wb') as file:
     topic_model = pickle.load(file)

topic_model.umap_model = Reducer_nD

Now, the issue is that when I save one instance of the parametric_umap a weird thing happens. After saving one instance I cannot save another instance, even the same instance that I saved before. With the first instance (for example, Reducer_3D.save('/directory/to/save/) I can save the model with these warnings and notes:

WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: /home/reducernd/encoder/assets
Keras encoder model saved to /home/ reducernd /encoder
WARNING:absl:Found untraced functions such as _update_step_xla while saving (showing 1 of 1). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: /home/ parametric_model/assets
INFO:tensorflow:Assets written to: /home/parametric_model/assets
Keras full model saved to /home/reducernd/parametric_model
Keras weights file (<HDF5 file "variables.h5" (mode r+)>) saving:
Keras model archive saving:
File Name                                             Modified             Size
metadata.json                                  2023-09-26 20:47:55           64
config.json                                    2023-09-26 20:47:55         6339
variables.h5                                   2023-09-26 20:47:55      6184520
Keras model archive loading:
File Name                                             Modified             Size
metadata.json                                  2023-09-26 20:47:54           64
config.json                                    2023-09-26 20:47:54         6339
variables.h5                                   2023-09-26 20:47:54      6184520
Keras weights file (<HDF5 file "variables.h5" (mode r)>) loading:
...layers..
Pickle of ParametricUMAP model saved to /home/reducernd/model.pkl

Now if I want to save any instance of the parametric umap, I get this error:

TypeError: Cannot serialize object <tensorflow.python.eager.polymorphic_function.polymorphic_function.Function object at 0x7f141071bc90> of type <class 'tensorflow.python.eager.polymorphic_function.polymorphic_function.Function'>. To be serializable, a class must implement the `get_config()` method.

This is a really weird behavior. Saving one instance of parametric UMAP object changes a function so that it is not serializable the second time. I got stuck here. I don't know how to save a complete model so that it can be loaded with all the model components, and I can run the transform function on new data! I really appreciate it if you could provide some thoughts that I can work on. I tried everything I could and failed.

MaartenGr commented 1 year ago

I am training on historical data, then I need to load the model and run transform on new embeddings each month.

The assigning of documents to topics is also done through the embeddings. Couldn't those be used instead of the UMAP/HDBSCAN combo for assignment? If you used safetensors, then that would circumvent the issue. I think it might be worth to try out.

This is a really weird behavior. Saving one instance of parametric UMAP object changes a function so that it is not serializable the second time. I got stuck here. I don't know how to save a complete model so that it can be loaded with all the model components, and I can run the transform function on new data! I really appreciate it if you could provide some thoughts that I can work on. I tried everything I could and failed.

Not sure what is happening here. It seems to be related specifically to UMAP, so it might be best to also post this issue on the UMAP repo. I think they will be able to help you out much better!