mjavedgohar commented 3 years ago

Hi @MaartenGr ,

As I understand about BERTopic; fit_transform() is to train model while transform() is for prediction. Am I right?? what is the best method to train the model for data from different sources e.g. twitter, reddit, facebook comments etc. I want to train the model once and use it for various datasets? should I have to divide data in sentences because some sources has very large comments (paragraphs) e.g. reddit or news articles?

Thanks

MaartenGr commented 3 years ago

You are correct. You can use fit_transform() to train the model. The transform() function is indeed used for prediction. Do note that fit_transform() not only trains the model but also predicts the data on which it was trained.

In practice, I would try to combine as many sources as possible before training the model. If you have various datasets, then you can simply combine them and train over all of the models. However, if there is a specific reason for training it on only a single dataset and predict for all others, then that is also possible. I can imagine it could be computationally expensive to train on all datasets or that you only want the topics from a single source represented. In those cases, it should be fine to train it on a single dataset although training on all of them is preferred.

This depends on the content of the large paragraphs. If you feel like or assume, that those paragraphs may contain multiple topics then I would advise splitting them up into sentences. You can use Spacy to split them up into sentences. However, if you think that there is only a single topic in the large paragraph then there is no need to split them up into sentences.

If possible, I would try training it on the data without splitting it up into sentences and see if they make sense. If not, then a sentence splitter would be your next step.

mjavedgohar commented 3 years ago

@MaartenGr Thanks for your reply It means once I trained the model I can save it for other dataset sets using transform() just like other ML models? Is there any method to evaluate the trained BERTopic?

MaartenGr commented 3 years ago

Yes! You can train the model and save it for other datasets just like other ML models. Do note that it is important that the versions of packages stay the same when switching between environments. Most issues related to model loading can be solved by looking at the environment.

This is actually quite a complex subject. Although there are methods that you can employ, such as c_v for evaluation they suffer from a number of issues. Topic modeling creates a highly subjective output in a way and evaluation that output is quite difficult. Do you focus on the topic coherence, its clustering capabilities, predictive power, or anything else? Those questions, in part, are what makes it difficult.

So while I am definitely not against evaluation metrics. I do think it is important to realize that they by no means represent a ground truth and can be misleading in some cases.

You can look towards Gensim or Octis for evaluation metrics/functions/libraries.

mjavedgohar commented 3 years ago

Hi @MaartenGr,

Thanks for your help. one more request I am using get_representative_docs() to get the representative docs but it returns only three. Is there any why to get required n number of docs for a specific topic??

Thanks again

MaartenGr commented 3 years ago

There are several reasons for using a fixed value. First, the value needs to be equal or lower than min_topic_size which may result in issues if that were not the case. Second, allowing for the top n can quickly lead to simply saving all documents in the model which makes the model explode in size. Third, three documents should give you enough of an idea to understand what the topic is about. Any more than that is typically redundant. Fourth, whenever topics are reduced, the representative documents are simply put together. In other words, if you merge 4 topics, then the new topic will contain 3 times 4 = 12 representative documents. Increasing n will again lead to too many representative documents for a single topic.

mjavedgohar commented 3 years ago

Hi @MaartenGr , It was working working fine but since this morning I am getting the following error when I tried to load BERTopic model in google Colab notebook. error at line "from bertopic import BERTopic" can you please help me for the error

ERROR:

TypeError Traceback (most recent call last)

in () 19 import contractions 20 ---> 21 from bertopic import BERTopic 22 from sklearn.feature_extraction.text import CountVectorizer 13 frames /usr/local/lib/python3.7/dist-packages/distributed/config.py in () 18 19 with open(fn) as f: ---> 20 defaults = yaml.load(f) 21 22 dask.config.update_defaults(defaults) TypeError: load() missing 1 required positional argument: 'Loader'

MaartenGr commented 3 years ago

This is an issue that quite randomly popped up. Fortunately, some fixes can be found here. Most likely, just running either pip uninstall distributed or pip install distributed==2021.9.0 will fix your issue. Hopefully, I can get to the bottom of this and fix it in the next release.

MaartenGr commented 3 years ago

@mjavedgohar A new version of BERTopic (v0.9.3) was released that should fix this issue and some others that should be helpful. You can install that version through pip install --upgrade bertopic. If you have any questions regarding this issue, release, or some other issue, please let me know!

mjavedgohar commented 3 years ago

@MaartenGr Thanks for your help I am following the following steps for training and predicting. is It ok for topic modelling using BERTopic? but in prediction it also including the training docs. I want to predict on only new docs.

Training:

load docs/sentences, 2. Instantiate the BERTopic model by defining parameters 3. fit_transform() for training listed below 4. save model

topic_model = BERTopic(low_memory=True, calculate_probabilities=False, nr_topics="auto", verbose=False, embedding_model=model, # using a pre-trained BERT model n_gram_range=(1, 3), vectorizer_model=CountVectorizer(ngram_range=(1, 3), stop_words=final_stop_words, min_df=0.05,
max_df=0.90,

))

Prediction:

load new docs/sentences 2. load saved model 3. Transform() for prediction

MaartenGr commented 3 years ago

Yes, it should be okay to train on your training docs and to predict them on only new docs.

mjavedgohar commented 3 years ago

@MaartenGr I am getting the same topics in prediction as in training using the above parameters. can you please help me to resolve this? In prediction I want to display the topics from new docs only. Or I have fit_transform() for every dataset ??

MaartenGr commented 3 years ago

If you are getting the same topics then you are most likely predicting the same documents like the ones you trained on. Typically, the workflow is something like this:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# We create a split between the documents that we train on and those that we predict
train_docs = fetch_20newsgroups(subset='train',  remove=('headers', 'footers', 'quotes'))['data']
test_docs = fetch_20newsgroups(subset='test',  remove=('headers', 'footers', 'quotes'))['data']

# Train the model only the train_docs
topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L3-v2", verbose=True)
topics, probs = topic_model.fit_transform(train_docs)

# Predict topics for test_docs
predicted_topics, predicted_probs = topic_model.transform(test_docs)

mjavedgohar commented 3 years ago

@MaartenGr Thaks for your help, I used the same code you shared but still I am geeting the same topics when using 'topic_model.get_topic_info()'

from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups

We create a split between the documents that we train on and those that we predict

train_docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))['data'] test_docs = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))['data']

Train the model only the train_docs

topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L3-v2", verbose=True) topics, probs = topic_model.fit_transform(train_docs) topic_model.get_topic_info()

Predict topics for test_docs

predicted_topics, predicted_probs = topic_model.transform(test_docs) topic_model.get_topic_info()

MaartenGr commented 3 years ago

Ah, when you use transform the model will not be trained. It will only predict which topics can be found in test_docs based on the topics trained on train_docs. This is the same with most models in general that have a transform or predict function. No changes will be made to the original model.

If you want to have new topics, then you need to re-train the model with all documents.

mjavedgohar commented 3 years ago

Hi @MaartenGr ,

I trained BERTopic model on HPC (server) and saved it. Now I am trying to load it in google colab notebook for visualization but I am getting the following error on topic_model.laod("model name")

ValueError: EOF: reading array data, expected 262144 bytes got 815

can you please help resolve this issue? What is is the best way to train model on hpc server and visualize it ??

Thanks

MaartenGr commented 3 years ago

The most important thing when loading in a model is making sure that the environment is the same. So, make sure that the packages and versions used in the saving environment are the same as the loading environment. For example, if you are using sentence-transformers v0.4.1 when saving the model it is highly advised to use the same version when loading the environment.

mjavedgohar commented 3 years ago

Hi @MaartenGr,

Thanks for your help. I am getting following error when tring to visualize the topics over time. can you please help me for this

Code: timestamps = review_data.timestamp.to_list() topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10) topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

Error ValueError: arrays must all be same length

MaartenGr commented 3 years ago

I cannot be sure without having your entire code but it seems that topics, docs, and timestamps are not the same size.

mjavedgohar commented 3 years ago

hi @MaartenGr ,

Thanks for you help. I using comments extracted from the reddit. Following is the code to generated topics.

Code: docs = review_data.body.to_list() docs=list(set(docs))

print("Embedding models")

from flair.embeddings import TransformerDocumentEmbeddings

Cbert_model = TransformerDocumentEmbeddings('digitalepidemiologylab/covid-twitter-bert-v2-mnli')#'digitalepidemiologylab/covid-twitter-bert-v2')

embeddings = Cbert_model.embed(docs)

from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-mpnet-base-v2') #'digitalepidemiologylab/covid-twitter-bert-v2-mnli') #'all-mpnet-base-v2'

import umap

umap_model = umap.UMAP(n_neighbors=100, # size of neighbour n_components=10, # dimentionality min_dist=0.1, #The default value for min_dist (as used above) is 0.1. We will look at a range of values from 0.0 through to 0.99. metric='cosine', low_memory=False)

import hdbscan

hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=50, min_samples=1, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

topic_model = BERTopic(top_n_words=10, n_gram_range=(1,3), calculate_probabilities=True, umap_model= umap_model, hdbscan_model=hdbscan_model, nr_topics="auto", verbose=True,
embedding_model=model, vectorizer_model=CountVectorizer(ngram_range=(1, 3), stop_words=final_stop_words ))

topics, probabilities = topic_model.fit_transform(docs)

timestamps = review_data.timestamp.to_list() topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10) topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

Error: File "Bert_topic_customized2.py", line 305, in topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/_bertopic.py", line 447, in topics_over_time documents = pd.DataFrame({"Document": docs, "Topic": topics, "Timestamps": timestamps}) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 119, in arrays_to_mgr index = _extract_index(arrays) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 635, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length

MaartenGr commented 3 years ago

Yes, as I mentioned before it seems that timestamps are a different size from docs and topics.

You are taking the set here:

docs = review_data.body.to_list()
docs = list(set(docs))

Which most likely reduces the number of docs and may shuffle the documents. Then, you take

timestamps = review_data.timestamp.to_list()

Which is larger than your docs. Thus, make sure that your docs and timestamps have the same size and that each index corresponds to one another. Thus, if there are 10_000 documents in docs there should be 10_000 documents in timestamps. Moreover, index 0 of docs should correspond to index 0 of timestamps.

mjavedgohar commented 3 years ago

Thanks @MaartenGr It worked Just another thing to discuss. I am getting the following error when number of topics are very low. Can visualize the topics >=2

fig = topic_model.visualize_topics() File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/_bertopic.py", line 909, in visualize_topics height=height) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/plotting/_topics.py", line 63, in visualize_topics embeddings = UMAP(n_neighbors=2, n_components=2, metric='hellinger').fittransform(embeddings) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap.py", line 2634, in fittransform self.fit(X, y) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap.py", line 2554, in fit self._raw_data[index], n_epochs, init, randomstate, # JH why raw data? File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap.py", line 2601, in _fit_embeddata self.verbose, File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap.py", line 1060, in simplicial_set_embedding metric_kwds=metric_kwds, File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/spectral.py", line 334, in spectral_layout maxiter=graph.shape[0] * 5, File "/home/muhammad.javed/.local/lib/python3.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1598, in eigsh raise TypeError("Cannot use scipy.linalg.eigh for sparse A with " TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

MaartenGr commented 3 years ago

That is more an issue with the number of topics than necessarily the method. Typically, BERTopic would result in tens or hundreds of topics. Any less and you likely have too little data to work with, or you have set the min_topic_size to high. I would advise trying to increase the number of topics as that would most likely be the best representation of the data.

mjavedgohar commented 3 years ago

Thanks @MaartenGr,

If I run the following code on my PC its works fine but on HPC (server) I am getting Error with same data. can you please help me for this.

timestamps = review_data.timestamp.to_list() topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10) topic_over_time=topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=100) topic_over_time.write_html(filename.split('.')[0]+"_customize3_Model_topicovertime.html")

Error

topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10)

File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/_bertopic.py", line 457, in topics_over_time format=datetime_format) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 887, in to_datetime values = convert_listlike(arg._values, format) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 408, in _convert_listlike_datetimes allow_object=True, File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2193, in objects_to_datetime64ns raise err File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2182, in objects_to_datetime64ns allow_mixed=allow_mixed, File "pandas/_libs/tslib.pyx", line 379, in pandas._libs.tslib.array_to_datetime File "pandas/_libs/tslib.pyx", line 611, in pandas._libs.tslib.array_to_datetime File "pandas/_libs/tslib.pyx", line 749, in pandas._libs.tslib._array_to_datetime_object File "pandas/_libs/tslib.pyx", line 740, in pandas._libs.tslib._array_to_datetime_object File "pandas/_libs/tslibs/parsing.pyx", line 257, in pandas._libs.tslibs.parsing.parse_datetime_string File "/home/muhammad.javed/.local/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1368, in parse return DEFAULTPARSER.parse(timestr, **kwargs) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 643, in parse raise ParserError("Unknown string format: %s", timestr) dateutil.parser._parser.ParserError: Unknown string format: timestamp

MaartenGr commented 3 years ago

If you run into issues when switching environments it is most likely a version control issue. Did you make sure to use the same versions of packages between environments? Also, is the code exactly the same between environments?

mjavedgohar commented 1 year ago

Hi @MaartenGr ,

I am tring to use the Guided Topic Modeling using the following code. Its working fine in Colab notebooks but getting error on my local machine. I am using BERTopic 0.12.0. Can you please help me for this??? Thanks

Code:

topic_model = BERTopic(language="english", verbose=True, seed_topic_list=seed_topic_list) topics, probs = topic_model.fit_transform(docs)

Error: topics, probs = topic_model.fit_transform(docs) File "...\Local\Programs\Python\Python38\lib\site-packages\bertopic_bertopic.py", line 344, in fit_transform y, embeddings = self._guided_topic_modeling(embeddings) File "...\Local\Programs\Python\Python38\lib\site-packages\bertopic_bertopic.py", line 2376, in _guided_topic_modeling embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1]) File "<__array_function__ internals>", line 5, in average File "..\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\function_base.py", line 407, in average scl = wgt.sum(axis=axis, dtype=result_dtype) File "..\Local\Programs\Python\Python38\lib\site-packages\numpy\core_methods.py", line 47, in _sum return umr_sum(a, axis, dtype, out, keepdims, initial, where) TypeError: No loop matching the specified signature and casting was found for ufunc add

MaartenGr commented 1 year ago

When you are working across different environments, then there might be an issue with the packages that you have installed. I would advise starting from a completely fresh environment and re-installing everything there. From your code, it seems that Numpy might be the culprit here, so I would think that a fresh environment might solve the issue.

elenacandellone commented 1 year ago

Hi @MaartenGr,

I have two datasets (train and test), and I would like to predict the topics for both, while fitting only the first one. This is my code:

eps = 1e-6
min_sample = 1

embedding_model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')

umap_model = UMAP(n_components=150, n_neighbors=50, random_state=42, metric="cosine")

hdbscan_model_arccos = HDBSCAN(
                            min_samples = min_sample,
                            min_cluster_size = 50, 
                            cluster_selection_epsilon = eps,
                            metric='cosine', algorithm = 'generic', cluster_selection_method = 'eom', 
                            prediction_data = True, core_dist_n_jobs=1)

vectorizer_model = CountVectorizer(vocabulary=vocab, 
                                max_features=10000,
                                stop_words = stopwords)

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

representation_model = MaximalMarginalRelevance(diversity=0.2)

topic_model= BERTopic(
        low_memory =True,
        language = 'spanish',
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model_arccos,
        vectorizer_model=vectorizer_model,
        ctfidf_model=ctfidf_model, 
        representation_model=representation_model,
        top_n_words = 20,
        n_gram_range = (1,3)
)

topics_train, probabilities_train = topic_model.fit_transform(docs_train, embeddings_train)

predicted_topics, predicted_probs = topic_model.transform(docs_test)

The problem is that, while doing the last step, I encountered the following error: attribute error: no prediction data was generated

I guess the problem is related to the fact that the function approximate_predict of hdbscan is unaware of the parameter cluster_selection_epsilon. Would you happen to have any idea on how to solve this issue?

Thanks in advance!

MaartenGr commented 1 year ago

@elenacandellone Hmmm, it indeed seems to be related to HDBSCAN. If it is a bug with HDBSCAN that cannot be solved within that package, you can instead save the topic model as safetensors and then load it in. Saving with safetensors removes the underlying HDBSCAN and UMAP and does inference through the embeddings only. This should prevent the issue you are having.

ChristinaBarz commented 5 months ago

Hi :) I just trained a model and am trying to proceed with dynamic topic modeling. I tried everything I could think of but keep receiving this: ValueError: All arrays must be of the same length. However, my docs and timestamps seem to have the same lenghts.. I am thankful for any help, best!!

My first try: final_df['UTC Date'] = pd.to_datetime(final_df['UTC Date'], format="%Y-%m-%d %H:%M:%S")

docs = final_df['Tweet'].tolist()
timestamps = final_df['UTC Date'].tolist()

topics_over_time = topic_model.topics_over_time(docs, timestamps, datetime_format="%Y-%m-%d %H:%M:%S", nr_bins=20)

My last try in which I transformed the UTC Date to timestamp format looked like this: final_df['UTC Date'] = pd.to_datetime(final_df['UTC Date'], errors='coerce', format="%Y-%m-%d %H:%M:%S")

final_df = final_df.dropna(subset=['UTC Date'])

timestamps = final_df['UTC Date'].astype(int) // 10**9

docs = final_df['Tweet'].tolist()

print("Initial length of docs:", len(docs)) print("Initial length of timestamps:", len(timestamps))

assert len(docs) == len(timestamps), "The lengths of docs and timestamps do not match!"

topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)

Output: Initial length of docs: 53291 Initial length of timestamps: 53291 ... ValueError: All arrays must be of the same length.

Also, here is the code from the trained model: docs = final_df['Tweet'].tolist()

embedding_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = embedding_model.encode(docs, show_progress_bar=True)

os.environ["OMP_NUM_THREADS"] = "1" os.environ["OPENBLAS_NUM_THREADS"] = "1" os.environ["MKL_NUM_THREADS"] = "1" os.environ["VECLIB_MAXIMUM_THREADS"] = "1" os.environ["NUMEXPR_NUM_THREADS"] = "1"

umap_model = UMAP(random_state=777, n_neighbors=50)

hdbscan_model = HDBSCAN(min_cluster_size=100, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=3)

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

keybert_model = KeyBERTInspired() pos_model = PartOfSpeech("en_core_web_sm") mmr_model = MaximalMarginalRelevance(diversity=0.3)

representation_model = { "KeyBERT": keybert_model, "MMR": mmr_model, "POS": pos_model }

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, ctfidf_model=ctfidf_model, embedding_model=embedding_model, vectorizer_model=vectorizer_model, representation_model=representation_model)

with parallel_backend('loky'): topics, probs = topic_model.fit_transform(docs)

MaartenGr commented 5 months ago

@ChristinaBarz Which version of BERTopic are you using? Also, can you format your code in markdown using those ``` tags? It is quite difficult to read.

ChristinaBarz commented 5 months ago

Hi @MaartenGr! Thanks for trying to help me out here :) I'm using version 0.16.2 and here is my code again (sorry about that):

My first try:

final_df['UTC Date'] = pd.to_datetime(final_df['UTC Date'], format="%Y-%m-%d %H:%M:%S")

docs = final_df['Tweet'].tolist()
timestamps = final_df['UTC Date'].tolist()

topics_over_time = topic_model.topics_over_time(docs, timestamps, datetime_format="%Y-%m-%d %H:%M:%S", nr_bins=20)

My last try, in which I transformed the UTC Date to timestamp format, looked like this:

final_df['UTC Date'] = pd.to_datetime(final_df['UTC Date'], errors='coerce', format="%Y-%m-%d %H:%M:%S")

final_df = final_df.dropna(subset=['UTC Date'])

timestamps = final_df['UTC Date'].astype(int) // 10**9

docs = final_df['Tweet'].tolist()

print("Initial length of docs:", len(docs))
print("Initial length of timestamps:", len(timestamps))

assert len(docs) == len(timestamps), "The lengths of docs and timestamps do not match!"

topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)

Output: Initial length of docs: 53291 Initial length of timestamps: 53291 ... ValueError: All arrays must be of the same length.

Also, here is the code from the trained model:

docs = final_df['Tweet'].tolist()

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

umap_model = UMAP(random_state=777, n_neighbors=50)

hdbscan_model = HDBSCAN(min_cluster_size=100, metric='euclidean',
cluster_selection_method='eom', prediction_data=True, min_samples=3)

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

keybert_model = KeyBERTInspired()
pos_model = PartOfSpeech("en_core_web_sm")
mmr_model = MaximalMarginalRelevance(diversity=0.3)

representation_model = {
"KeyBERT": keybert_model,
"MMR": mmr_model,
"POS": pos_model
}

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, ctfidf_model=ctfidf_model, embedding_model=embedding_model, vectorizer_model=vectorizer_model, representation_model=representation_model)

with parallel_backend('loky'):
topics, probs = topic_model.fit_transform(docs)

MaartenGr commented 5 months ago

@ChristinaBarz It is not clear from your code but are the docs you used to train the model (.fit_transform(docs)) the same documents as you using for dynamic topic modeling (.topics_over_time(docs)). They need to be the same documents or at the very least, the same size.

ChristinaBarz commented 5 months ago

@MaartenGr yes they are exactly the same!

MaartenGr commented 4 months ago

@ChristinaBarz Hmmm, then I'm not sure. Is the code you shared the complete code? You didn't load and save the model in between?

ChristinaBarz commented 4 months ago

@MaartenGr thank you for your help :) I found the problem! Unfortunately, a few rows had a divergent date format..

MaartenGr / BERTopic

Train and Predict BERTopic #278

We create a split between the documents that we train on and those that we predict

Train the model only the train_docs

Predict topics for test_docs

from flair.embeddings import TransformerDocumentEmbeddings

Cbert_model = TransformerDocumentEmbeddings('digitalepidemiologylab/covid-twitter-bert-v2-mnli')#'digitalepidemiologylab/covid-twitter-bert-v2')

embeddings = Cbert_model.embed(docs)