Closed mjavedgohar closed 3 years ago
You are correct. You can use fit_transform()
to train the model. The transform()
function is indeed used for prediction. Do note that fit_transform()
not only trains the model but also predicts the data on which it was trained.
In practice, I would try to combine as many sources as possible before training the model. If you have various datasets, then you can simply combine them and train over all of the models. However, if there is a specific reason for training it on only a single dataset and predict for all others, then that is also possible. I can imagine it could be computationally expensive to train on all datasets or that you only want the topics from a single source represented. In those cases, it should be fine to train it on a single dataset although training on all of them is preferred.
This depends on the content of the large paragraphs. If you feel like or assume, that those paragraphs may contain multiple topics then I would advise splitting them up into sentences. You can use Spacy to split them up into sentences. However, if you think that there is only a single topic in the large paragraph then there is no need to split them up into sentences.
If possible, I would try training it on the data without splitting it up into sentences and see if they make sense. If not, then a sentence splitter would be your next step.
@MaartenGr Thanks for your reply It means once I trained the model I can save it for other dataset sets using transform() just like other ML models? Is there any method to evaluate the trained BERTopic?
Yes! You can train the model and save it for other datasets just like other ML models. Do note that it is important that the versions of packages stay the same when switching between environments. Most issues related to model loading can be solved by looking at the environment.
This is actually quite a complex subject. Although there are methods that you can employ, such as c_v
for evaluation they suffer from a number of issues. Topic modeling creates a highly subjective output in a way and evaluation that output is quite difficult. Do you focus on the topic coherence, its clustering capabilities, predictive power, or anything else? Those questions, in part, are what makes it difficult.
So while I am definitely not against evaluation metrics. I do think it is important to realize that they by no means represent a ground truth and can be misleading in some cases.
You can look towards Gensim or Octis for evaluation metrics/functions/libraries.
Hi @MaartenGr,
Thanks for your help. one more request I am using get_representative_docs() to get the representative docs but it returns only three. Is there any why to get required n number of docs for a specific topic??
Thanks again
There are several reasons for using a fixed value. First, the value needs to be equal or lower than min_topic_size
which may result in issues if that were not the case. Second, allowing for the top n can quickly lead to simply saving all documents in the model which makes the model explode in size. Third, three documents should give you enough of an idea to understand what the topic is about. Any more than that is typically redundant. Fourth, whenever topics are reduced, the representative documents are simply put together. In other words, if you merge 4 topics, then the new topic will contain 3 times 4 = 12 representative documents. Increasing n will again lead to too many representative documents for a single topic.
Hi @MaartenGr , It was working working fine but since this morning I am getting the following error when I tried to load BERTopic model in google Colab notebook. error at line "from bertopic import BERTopic" can you please help me for the error
ERROR:
TypeError Traceback (most recent call last)
This is an issue that quite randomly popped up. Fortunately, some fixes can be found here. Most likely, just running either pip uninstall distributed
or pip install distributed==2021.9.0
will fix your issue. Hopefully, I can get to the bottom of this and fix it in the next release.
@mjavedgohar A new version of BERTopic (v0.9.3) was released that should fix this issue and some others that should be helpful. You can install that version through pip install --upgrade bertopic
. If you have any questions regarding this issue, release, or some other issue, please let me know!
@MaartenGr Thanks for your help I am following the following steps for training and predicting. is It ok for topic modelling using BERTopic? but in prediction it also including the training docs. I want to predict on only new docs.
Training:
topic_model = BERTopic(low_memory=True,
calculate_probabilities=False,
nr_topics="auto",
verbose=False,
embedding_model=model, # using a pre-trained BERT model
n_gram_range=(1, 3),
vectorizer_model=CountVectorizer(ngram_range=(1, 3),
stop_words=final_stop_words,
min_df=0.05,
max_df=0.90,
))
Prediction:
Yes, it should be okay to train on your training docs and to predict them on only new docs.
@MaartenGr I am getting the same topics in prediction as in training using the above parameters. can you please help me to resolve this? In prediction I want to display the topics from new docs only. Or I have fit_transform() for every dataset ??
If you are getting the same topics then you are most likely predicting the same documents like the ones you trained on. Typically, the workflow is something like this:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# We create a split between the documents that we train on and those that we predict
train_docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))['data']
test_docs = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))['data']
# Train the model only the train_docs
topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L3-v2", verbose=True)
topics, probs = topic_model.fit_transform(train_docs)
# Predict topics for test_docs
predicted_topics, predicted_probs = topic_model.transform(test_docs)
@MaartenGr Thaks for your help, I used the same code you shared but still I am geeting the same topics when using 'topic_model.get_topic_info()'
from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups
train_docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))['data'] test_docs = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic(embedding_model="paraphrase-MiniLM-L3-v2", verbose=True) topics, probs = topic_model.fit_transform(train_docs) topic_model.get_topic_info()
predicted_topics, predicted_probs = topic_model.transform(test_docs) topic_model.get_topic_info()
Ah, when you use transform
the model will not be trained. It will only predict which topics can be found in test_docs
based on the topics trained on train_docs
. This is the same with most models in general that have a transform
or predict
function. No changes will be made to the original model.
If you want to have new topics, then you need to re-train the model with all documents.
Hi @MaartenGr ,
I trained BERTopic model on HPC (server) and saved it. Now I am trying to load it in google colab notebook for visualization but I am getting the following error on topic_model.laod("model name")
ValueError: EOF: reading array data, expected 262144 bytes got 815
can you please help resolve this issue? What is is the best way to train model on hpc server and visualize it ??
Thanks
The most important thing when loading in a model is making sure that the environment is the same. So, make sure that the packages and versions used in the saving environment are the same as the loading environment. For example, if you are using sentence-transformers v0.4.1 when saving the model it is highly advised to use the same version when loading the environment.
Hi @MaartenGr,
Thanks for your help. I am getting following error when tring to visualize the topics over time. can you please help me for this
Code: timestamps = review_data.timestamp.to_list() topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10) topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)
Error ValueError: arrays must all be same length
I cannot be sure without having your entire code but it seems that topics
, docs
, and timestamps
are not the same size.
hi @MaartenGr ,
Thanks for you help. I using comments extracted from the reddit. Following is the code to generated topics.
Code: docs = review_data.body.to_list() docs=list(set(docs))
print("Embedding models")
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-mpnet-base-v2') #'digitalepidemiologylab/covid-twitter-bert-v2-mnli') #'all-mpnet-base-v2'
import umap
umap_model = umap.UMAP(n_neighbors=100, # size of neighbour n_components=10, # dimentionality min_dist=0.1, #The default value for min_dist (as used above) is 0.1. We will look at a range of values from 0.0 through to 0.99. metric='cosine', low_memory=False)
import hdbscan
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=50, min_samples=1, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
topic_model = BERTopic(top_n_words=10,
n_gram_range=(1,3),
calculate_probabilities=True,
umap_model= umap_model,
hdbscan_model=hdbscan_model,
nr_topics="auto",
verbose=True,
embedding_model=model,
vectorizer_model=CountVectorizer(ngram_range=(1, 3),
stop_words=final_stop_words
))
topics, probabilities = topic_model.fit_transform(docs)
timestamps = review_data.timestamp.to_list() topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10) topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)
Error:
File "Bert_topic_customized2.py", line 305, in
Yes, as I mentioned before it seems that timestamps
are a different size from docs
and topics
.
You are taking the set here:
docs = review_data.body.to_list()
docs = list(set(docs))
Which most likely reduces the number of docs
and may shuffle the documents. Then, you take
timestamps = review_data.timestamp.to_list()
Which is larger than your docs
. Thus, make sure that your docs
and timestamps
have the same size and that each index corresponds to one another. Thus, if there are 10_000 documents in docs
there should be 10_000 documents in timestamps
. Moreover, index 0 of docs
should correspond to index 0 of timestamps
.
Thanks @MaartenGr It worked Just another thing to discuss. I am getting the following error when number of topics are very low. Can visualize the topics >=2
fig = topic_model.visualize_topics() File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/_bertopic.py", line 909, in visualize_topics height=height) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/plotting/_topics.py", line 63, in visualize_topics embeddings = UMAP(n_neighbors=2, n_components=2, metric='hellinger').fittransform(embeddings) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap.py", line 2634, in fittransform self.fit(X, y) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap.py", line 2554, in fit self._raw_data[index], n_epochs, init, randomstate, # JH why raw data? File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap.py", line 2601, in _fit_embeddata self.verbose, File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/umap.py", line 1060, in simplicial_set_embedding metric_kwds=metric_kwds, File "/home/muhammad.javed/.local/lib/python3.7/site-packages/umap/spectral.py", line 334, in spectral_layout maxiter=graph.shape[0] * 5, File "/home/muhammad.javed/.local/lib/python3.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1598, in eigsh raise TypeError("Cannot use scipy.linalg.eigh for sparse A with " TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.
That is more an issue with the number of topics than necessarily the method. Typically, BERTopic would result in tens or hundreds of topics. Any less and you likely have too little data to work with, or you have set the min_topic_size
to high. I would advise trying to increase the number of topics as that would most likely be the best representation of the data.
Thanks @MaartenGr,
If I run the following code on my PC its works fine but on HPC (server) I am getting Error with same data. can you please help me for this.
timestamps = review_data.timestamp.to_list() topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10) topic_over_time=topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=100) topic_over_time.write_html(filename.split('.')[0]+"_customize3_Model_topicovertime.html")
Error
topics_over_time = topic_model.topics_over_time(docs, topics, timestamps, nr_bins=10)
File "/home/muhammad.javed/.local/lib/python3.7/site-packages/bertopic/_bertopic.py", line 457, in topics_over_time format=datetime_format) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 887, in to_datetime values = convert_listlike(arg._values, format) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 408, in _convert_listlike_datetimes allow_object=True, File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2193, in objects_to_datetime64ns raise err File "/home/muhammad.javed/.local/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2182, in objects_to_datetime64ns allow_mixed=allow_mixed, File "pandas/_libs/tslib.pyx", line 379, in pandas._libs.tslib.array_to_datetime File "pandas/_libs/tslib.pyx", line 611, in pandas._libs.tslib.array_to_datetime File "pandas/_libs/tslib.pyx", line 749, in pandas._libs.tslib._array_to_datetime_object File "pandas/_libs/tslib.pyx", line 740, in pandas._libs.tslib._array_to_datetime_object File "pandas/_libs/tslibs/parsing.pyx", line 257, in pandas._libs.tslibs.parsing.parse_datetime_string File "/home/muhammad.javed/.local/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1368, in parse return DEFAULTPARSER.parse(timestr, **kwargs) File "/home/muhammad.javed/.local/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 643, in parse raise ParserError("Unknown string format: %s", timestr) dateutil.parser._parser.ParserError: Unknown string format: timestamp
If you run into issues when switching environments it is most likely a version control issue. Did you make sure to use the same versions of packages between environments? Also, is the code exactly the same between environments?
Hi @MaartenGr ,
I am tring to use the Guided Topic Modeling using the following code. Its working fine in Colab notebooks but getting error on my local machine. I am using BERTopic 0.12.0. Can you please help me for this??? Thanks
Code:
topic_model = BERTopic(language="english", verbose=True, seed_topic_list=seed_topic_list) topics, probs = topic_model.fit_transform(docs)
Error: topics, probs = topic_model.fit_transform(docs) File "...\Local\Programs\Python\Python38\lib\site-packages\bertopic_bertopic.py", line 344, in fit_transform y, embeddings = self._guided_topic_modeling(embeddings) File "...\Local\Programs\Python\Python38\lib\site-packages\bertopic_bertopic.py", line 2376, in _guided_topic_modeling embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1]) File "<__array_function__ internals>", line 5, in average File "..\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\function_base.py", line 407, in average scl = wgt.sum(axis=axis, dtype=result_dtype) File "..\Local\Programs\Python\Python38\lib\site-packages\numpy\core_methods.py", line 47, in _sum return umr_sum(a, axis, dtype, out, keepdims, initial, where) TypeError: No loop matching the specified signature and casting was found for ufunc add
When you are working across different environments, then there might be an issue with the packages that you have installed. I would advise starting from a completely fresh environment and re-installing everything there. From your code, it seems that Numpy might be the culprit here, so I would think that a fresh environment might solve the issue.
Hi @MaartenGr,
I have two datasets (train and test), and I would like to predict the topics for both, while fitting only the first one. This is my code:
eps = 1e-6
min_sample = 1
embedding_model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
umap_model = UMAP(n_components=150, n_neighbors=50, random_state=42, metric="cosine")
hdbscan_model_arccos = HDBSCAN(
min_samples = min_sample,
min_cluster_size = 50,
cluster_selection_epsilon = eps,
metric='cosine', algorithm = 'generic', cluster_selection_method = 'eom',
prediction_data = True, core_dist_n_jobs=1)
vectorizer_model = CountVectorizer(vocabulary=vocab,
max_features=10000,
stop_words = stopwords)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = MaximalMarginalRelevance(diversity=0.2)
topic_model= BERTopic(
low_memory =True,
language = 'spanish',
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model_arccos,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
representation_model=representation_model,
top_n_words = 20,
n_gram_range = (1,3)
)
topics_train, probabilities_train = topic_model.fit_transform(docs_train, embeddings_train)
predicted_topics, predicted_probs = topic_model.transform(docs_test)
The problem is that, while doing the last step, I encountered the following error: attribute error: no prediction data was generated
I guess the problem is related to the fact that the function approximate_predict of hdbscan is unaware of the parameter cluster_selection_epsilon. Would you happen to have any idea on how to solve this issue?
Thanks in advance!
@elenacandellone Hmmm, it indeed seems to be related to HDBSCAN. If it is a bug with HDBSCAN that cannot be solved within that package, you can instead save the topic model as safetensors
and then load it in. Saving with safetensors
removes the underlying HDBSCAN and UMAP and does inference through the embeddings only. This should prevent the issue you are having.
Hi :) I just trained a model and am trying to proceed with dynamic topic modeling. I tried everything I could think of but keep receiving this: ValueError: All arrays must be of the same length. However, my docs and timestamps seem to have the same lenghts.. I am thankful for any help, best!!
My first try: final_df['UTC Date'] = pd.to_datetime(final_df['UTC Date'], format="%Y-%m-%d %H:%M:%S")
docs = final_df['Tweet'].tolist()
timestamps = final_df['UTC Date'].tolist()
topics_over_time = topic_model.topics_over_time(docs, timestamps, datetime_format="%Y-%m-%d %H:%M:%S", nr_bins=20)
My last try in which I transformed the UTC Date to timestamp format looked like this: final_df['UTC Date'] = pd.to_datetime(final_df['UTC Date'], errors='coerce', format="%Y-%m-%d %H:%M:%S")
final_df = final_df.dropna(subset=['UTC Date'])
timestamps = final_df['UTC Date'].astype(int) // 10**9
docs = final_df['Tweet'].tolist()
print("Initial length of docs:", len(docs)) print("Initial length of timestamps:", len(timestamps))
assert len(docs) == len(timestamps), "The lengths of docs and timestamps do not match!"
topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)
Output: Initial length of docs: 53291 Initial length of timestamps: 53291 ... ValueError: All arrays must be of the same length.
Also, here is the code from the trained model: docs = final_df['Tweet'].tolist()
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = embedding_model.encode(docs, show_progress_bar=True)
os.environ["OMP_NUM_THREADS"] = "1" os.environ["OPENBLAS_NUM_THREADS"] = "1" os.environ["MKL_NUM_THREADS"] = "1" os.environ["VECLIB_MAXIMUM_THREADS"] = "1" os.environ["NUMEXPR_NUM_THREADS"] = "1"
umap_model = UMAP(random_state=777, n_neighbors=50)
hdbscan_model = HDBSCAN(min_cluster_size=100, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=3)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))
keybert_model = KeyBERTInspired() pos_model = PartOfSpeech("en_core_web_sm") mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = { "KeyBERT": keybert_model, "MMR": mmr_model, "POS": pos_model }
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, ctfidf_model=ctfidf_model, embedding_model=embedding_model, vectorizer_model=vectorizer_model, representation_model=representation_model)
with parallel_backend('loky'): topics, probs = topic_model.fit_transform(docs)
@ChristinaBarz Which version of BERTopic are you using? Also, can you format your code in markdown using those ``` tags? It is quite difficult to read.
Hi @MaartenGr! Thanks for trying to help me out here :) I'm using version 0.16.2 and here is my code again (sorry about that):
My first try:
final_df['UTC Date'] = pd.to_datetime(final_df['UTC Date'], format="%Y-%m-%d %H:%M:%S")
docs = final_df['Tweet'].tolist()
timestamps = final_df['UTC Date'].tolist()
topics_over_time = topic_model.topics_over_time(docs, timestamps, datetime_format="%Y-%m-%d %H:%M:%S", nr_bins=20)
My last try, in which I transformed the UTC Date to timestamp format, looked like this:
final_df['UTC Date'] = pd.to_datetime(final_df['UTC Date'], errors='coerce', format="%Y-%m-%d %H:%M:%S")
final_df = final_df.dropna(subset=['UTC Date'])
timestamps = final_df['UTC Date'].astype(int) // 10**9
docs = final_df['Tweet'].tolist()
print("Initial length of docs:", len(docs))
print("Initial length of timestamps:", len(timestamps))
assert len(docs) == len(timestamps), "The lengths of docs and timestamps do not match!"
topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=20)
Output: Initial length of docs: 53291 Initial length of timestamps: 53291 ... ValueError: All arrays must be of the same length.
Also, here is the code from the trained model:
docs = final_df['Tweet'].tolist()
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
umap_model = UMAP(random_state=777, n_neighbors=50)
hdbscan_model = HDBSCAN(min_cluster_size=100, metric='euclidean',
cluster_selection_method='eom', prediction_data=True, min_samples=3)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))
keybert_model = KeyBERTInspired()
pos_model = PartOfSpeech("en_core_web_sm")
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = {
"KeyBERT": keybert_model,
"MMR": mmr_model,
"POS": pos_model
}
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, ctfidf_model=ctfidf_model, embedding_model=embedding_model, vectorizer_model=vectorizer_model, representation_model=representation_model)
with parallel_backend('loky'):
topics, probs = topic_model.fit_transform(docs)
@ChristinaBarz It is not clear from your code but are the docs
you used to train the model (.fit_transform(docs)
) the same documents as you using for dynamic topic modeling (.topics_over_time(docs)
). They need to be the same documents or at the very least, the same size.
@MaartenGr yes they are exactly the same!
@ChristinaBarz Hmmm, then I'm not sure. Is the code you shared the complete code? You didn't load and save the model in between?
@MaartenGr thank you for your help :) I found the problem! Unfortunately, a few rows had a divergent date format..
Hi @MaartenGr ,
As I understand about BERTopic; fit_transform() is to train model while transform() is for prediction. Am I right?? what is the best method to train the model for data from different sources e.g. twitter, reddit, facebook comments etc. I want to train the model once and use it for various datasets? should I have to divide data in sentences because some sources has very large comments (paragraphs) e.g. reddit or news articles?
Thanks