Open zhimin-z opened 1 year ago
I am using bertopic==0.15.0
, so I think it is related to the breaking update: https://stackoverflow.com/questions/76631305/attributeerror-cant-get-attribute-euclideandistance-on-module-sklearn-metr For those who still use scikit-learn==1.2.2
, it is done. @MaartenGr
However, if I upgrade scikit-learn==1.3.0
, it give another error instead:
(.venv) 21zz42@docjk-gpu-02:~/Asset-Management-Topic-Modeling$ python /home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py
Traceback (most recent call last):
File "/home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py", line 5, in <module>
from bertopic import BERTopic
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/__init__.py", line 1, in <module>
from bertopic._bertopic import BERTopic
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 37, in <module>
import hdbscan
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/hdbscan/__init__.py", line 1, in <module>
from .hdbscan_ import HDBSCAN, hdbscan
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/hdbscan/hdbscan_.py", line 40, in <module>
FAST_METRICS = KDTree.valid_metrics + BallTree.valid_metrics + ["cosine", "arccos"]
TypeError: unsupported operand type(s) for +: 'builtin_function_or_method' and 'builtin_function_or_method'
I found the only solution is to upgrade scikit-learn
and hdbscan
simultaneously. Would you mind making a new release accordingly?
It seems that this is a known issue for HDBSCAN which should already be fixed in their main branch. There is a new version of HDBSCAN but there are some commits after that. I believe this mostly relates to version controlling your environment when you pickle BERTopic. When using BERTopic v0.15, it is highly advised using either pytorch or safetensors to save the model. This is more robust to changing environments and corresponding dependencies.
With respect to a new release, I think it is best to wait until HDBSCAN is a bit more stable seeing as there are still some individuals experiencing some issues.
How about adding a safetensor
dependency in the BerTopic? Whenever I use it, it shows missing package.
When install safetensor dependency and saved the model as safetensor and load it again it shows the following error:
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
Traceback (most recent call last):
File "/home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py", line 24, in <module>
topic_model = BERTopic.load(os.path.join(path_model, model_name))
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 3008, in load
topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 4024, in _create_model_from_files
topic_model.vectorizer_model = CountVectorizer(**ctfidf_config["vectorizer_model"]["params"])
TypeError: CountVectorizer.__init__() got an unexpected keyword argument 'norm'
@zhimin-z Could you share your full code including how you saved and loaded the model? Also, are the environments in any way different between saving and loading the model?
@zhimin-z Could you share your full code including how you saved and loaded the model? Also, are the environments in any way different between saving and loading the model?
Sure. THis is for saving model in the hyperparameter sweep:
import gensim.corpora as corpora
import pandas as pd
import wandb
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.coherencemodel import CoherenceModel
# from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP
path_dataset = 'Dataset'
path_model = os.path.join('Result', 'RQ1', 'Model')
if not os.path.exists(path_model):
os.makedirs(path_model)
wandb_project = 'asset-management-topic-modeling'
os.environ["WANDB_API_KEY"] = 'xxxxxxxx'
os.environ["TOKENIZERS_PARALLELISM"] = "true"
os.environ["WANDB__SERVICE_WAIT"] = "100"
# set default sweep configuration
config_defaults = {
# Refer to https://www.sbert.net/docs/pretrained_models.html
'model_name': 'sentence-transformers/all-mpnet-base-v2',
'metric_distane': 'cosine',
'calculate_probabilities': True,
# 'reduce_frequent_words': True,
'prediction_data': True,
'low_memory': False,
'min_cluster_size': 50,
'random_state': 42,
'ngram_range': 2
}
config_sweep = {
'method': 'grid',
'metric': {
'name': 'Coherence CV',
'goal': 'maximize',
},
'parameters': {
'n_components': {
'values': list(range(3,6)),
},
}
}
class TopicModeling:
def __init__(self, column_name):
# Initialize an empty list to store top models
self.top_models = []
self.path_model = path_model
df = pd.read_json(os.path.join(path_dataset, 'preprocessed.json'))
self.docs = df[df[column_name].map(len) > 0][column_name].tolist()
config_sweep['name'] = column_name
config_sweep['parameters']['min_samples'] = {
'values': list(range(1, config_defaults['min_cluster_size'] + 1)),
}
def __train(self):
# Initialize a new wandb run
with wandb.init() as run:
# update any values not set by sweep
run.config.setdefaults(config_defaults)
# Step 1 - Extract embeddings
embedding_model = SentenceTransformer(run.config.model_name)
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_components=wandb.config.n_components, metric=run.config.metric_distane, random_state=run.config.random_state, low_memory=run.config.low_memory)
# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=run.config.min_cluster_size, min_samples=wandb.config.min_samples, prediction_data=run.config.prediction_data)
# Step 4 - Tokenize topics
vectorizer_model = TfidfVectorizer(ngram_range=(1, run.config.ngram_range))
# Step 5 - Create topic representation
# ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=run.config.reduce_frequent_words)
# Step 6 - Fine-tune topic representation
representation_model = KeyBERTInspired()
# All steps together
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
# ctfidf_model=ctfidf_model,
representation_model=representation_model,
calculate_probabilities=run.config.calculate_probabilities
)
topics, _ = topic_model.fit_transform(self.docs)
# Preprocess Documents
documents = pd.DataFrame({
"Document": self.docs,
"ID": range(len(self.docs)),
"Topic": topics
})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)
# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()
# Extract features for Topic Coherence evaluation
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] for topic in range(len(set(topics))-1)]
coherence_cv = CoherenceModel(
topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='c_v'
)
coherence_umass = CoherenceModel(
topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='u_mass'
)
coherence_cuci = CoherenceModel(
topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='c_uci'
)
coherence_cnpmi = CoherenceModel(
topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='c_npmi'
)
coherence_cv = coherence_cv.get_coherence()
wandb.log({'Coherence CV': coherence_cv})
wandb.log({'Coherence UMASS': coherence_umass.get_coherence()})
wandb.log({'Coherence UCI': coherence_cuci.get_coherence()})
wandb.log({'Coherence NPMI': coherence_cnpmi.get_coherence()})
wandb.log({'Topic Number': topic_model.get_topic_info().shape[0] - 1})
wandb.log({'Uncategorized Post Number': topic_model.get_topic_info().at[0, 'Count']})
model_name = f'{config_sweep["name"]}_{run.id}'
topic_model.save(os.path.join(self.path_model, model_name), serialization="safetensors", save_ctfidf=True, save_embedding_model=config_defaults['model_name'])
def sweep(self):
wandb.login()
sweep_id = wandb.sweep(config_sweep, project=wandb_project)
wandb.agent(sweep_id, function=self.__train)
and this is for loading the model with the highest coherence CV score:
import os
import pickle
import pandas as pd
from bertopic import BERTopic
path_rq1 = os.path.join('Result', 'RQ1')
path_model = os.path.join(path_rq1, 'Model')
embedding_model = 'sentence-transformers/all-mpnet-base-v2'
model_name = 'Challenge_preprocessed_gpt_summary_7of8v67c'
column = '_'.join(model_name.split('_')[:-1])
df = pd.read_json(os.path.join('Dataset', 'preprocessed.json'))
df['Challenge_topic'] = -1
indice = []
docs = []
for index, row in df.iterrows():
if pd.notna(row[column]) and len(row[column]):
indice.append(index)
docs.append(row[column])
topic_model = BERTopic.load(os.path.join(path_model, model_name), embedding_model=embedding_model)
topic_number = topic_model.get_topic_info().shape[0] - 1
topics, probs = topic_model.transform(docs)
# persist the topic terms
with open(os.path.join(path_rq1, 'Topic terms.pickle'), 'wb') as handle:
topic_terms = []
for i in range(topic_number):
topic_terms.append(topic_model.get_topic(i))
pickle.dump(topic_terms, handle, protocol=pickle.HIGHEST_PROTOCOL)
fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_rq1, 'Topic visualization.html'))
fig = topic_model.visualize_barchart(top_n_topics=topic_number, n_words=10)
fig.write_html(os.path.join(path_rq1, 'Term visualization.html'))
fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(path_rq1, 'Topic similarity visualization.html'))
# This uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
topics_new = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")
# persist the document topics
for index, topic in zip(indice, topics_new):
df.at[index, 'Challenge_topic'] = topic
df = df[df.columns.drop(list(df.filter(regex=r'preprocessed|gpt_summary')))]
df.to_json(os.path.join(path_rq1, 'topics.json'), indent=4, orient='records')
and my requirements:
bertopic==0.15.0
gensim==4.3.1
safetensors==0.3.1
wandb==0.15.8
hdbscan==0.8.33
scikit-learn==1.3.0
I am sure I have not touched dependency during model I/O, I doubt there exists some breaking update in the requirements. @MaartenGr
I tried with pytorch
serialization mode to save/load model, the save functions well, but load never works...TypeError: CountVectorizer.__init__() got an unexpected keyword argument 'norm'
The only option works now is to use the default save of BerTopic for my requirements.
Strange, I can't seem to reproduce the issue. I am also surprised that there seems to be a 'norm' argument in the config. Could you show what is inside ctfidf_config.json
? You can skip over the vocab, only the beginning is of interest here (ctfidf_model
and vectorizer_model
).
Aside from the above, it seems that it should work if you forego save_ctfidf=True
but of course is not the most ideal solution.
ctfidf_config.json
is too large, here is the beginning of this file:
{
"ctfidf_model": {
"bm25_weighting": false,
"reduce_frequent_words": false
},
"vectorizer_model": {
"params": {
"analyzer": "word",
"binary": false,
"decode_error": "strict",
"encoding": "utf-8",
"input": "content",
"lowercase": true,
"max_df": 1.0,
"max_features": null,
"min_df": 1,
"ngram_range": [
1,
2
],
"norm": "l2",
"smooth_idf": true,
"stop_words": null,
"strip_accents": null,
"sublinear_tf": false,
"token_pattern": "(?u)\\b\\w\\w+\\b",
"use_idf": true,
"vocabulary": null
},
"vocab": {
"model": 80843,
"logs": 75753,
"reports": 112906,
"insights": 62909,
....
"norm": "l2",
Strange, that parameter should not be in the CountVectorizer class at all. Ah, now I see, you are using TfidfVectorizer
instead of CountVectorizer
. You should use CountVectorizer
instead, that should solve your issue.
"norm": "l2",
Strange, that parameter should not be in the CountVectorizer class at all. Ah, now I see, you are using
TfidfVectorizer
instead ofCountVectorizer
. You should useCountVectorizer
instead, that should solve your issue.
Thanks for your fast reply. @MaartenGr I have some questions now:
TfidfVectorizer
better than CountVectorizer
in terms of preprocessing?TfidfVectorizer
, is there any way to load the trained model afterward with safetensor?Is TfidfVectorizer better than CountVectorizer in terms of preprocessing?
You should actually not use the TfidfVectorizer
since c-TF-IDF
is applied on top of the vectorizer, which is expected to be a plain bag-of-words.
If I already trained the model with TfidfVectorizer , is there any way to load the trained model afterward with safetensor?
I think you can load the model if you remove all files belonging to the TfidfVectorizer
. This would, however, create a more limited version of BERTopic. It would essentially be the same as using save_ctfidf=False
.
It works, thanks!
When I load the generated bertopic model, it give the following error traces:
When I am running the following code: