MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

AttributeError: Can't get attribute 'EuclideanDistance64' on <module 'sklearn.metrics._dist_metrics' #1450

Open zhimin-z opened 1 year ago

zhimin-z commented 1 year ago

When I load the generated bertopic model, it give the following error traces:

/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
Traceback (most recent call last):
  File "/home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py", line 24, in <module>
    topic_model = BERTopic.load(os.path.join(path_model, model_name))
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 2998, in load
    topic_model = joblib.load(file)
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/joblib/numpy_pickle.py", line 648, in load
    obj = _unpickle(fobj)
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.10/pickle.py", line 1213, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.10/pickle.py", line 1538, in load_stack_global
    self.append(self.find_class(module, name))
  File "/usr/lib/python3.10/pickle.py", line 1582, in find_class
    return _getattribute(sys.modules[module], name)[0]
  File "/usr/lib/python3.10/pickle.py", line 331, in _getattribute
    raise AttributeError("Can't get attribute {!r} on {!r}"
AttributeError: Can't get attribute 'EuclideanDistance64' on <module 'sklearn.metrics._dist_metrics' from '/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/sklearn/metrics/_dist_metrics.cpython-310-x86_64-linux-gnu.so'>

When I am running the following code:

import os
import pickle
import pandas as pd

from bertopic import BERTopic

path_rq1 = os.path.join('Result', 'RQ1')
path_model = os.path.join(path_rq1, 'Model')

model_name = 'Challenge_preprocessed_gpt_summary_fzqzh0m6'
column = '_'.join(model_name.split('_')[:-1])

df = pd.read_json(os.path.join('Dataset', 'preprocessed.json'))
df['Challenge_topic'] = -1

indice = []
docs = []

for index, row in df.iterrows():
    if pd.notna(row[column]) and len(row[column]):
        indice.append(index)
        docs.append(row[column])

topic_model = BERTopic.load(os.path.join(path_model, model_name))
topic_number = topic_model.get_topic_info().shape[0] - 1
topics, probs = topic_model.transform(docs)

# persist the topic terms
with open(os.path.join(path_rq1, 'Topic terms.pickle'), 'wb') as handle:
    topic_terms = []
    for i in range(topic_number):
        topic_terms.append(topic_model.get_topic(i))
    pickle.dump(topic_terms, handle, protocol=pickle.HIGHEST_PROTOCOL)

fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_rq1, 'Topic visualization.html'))

fig = topic_model.visualize_barchart(top_n_topics=topic_number, n_words=10)
fig.write_html(os.path.join(path_rq1, 'Term visualization.html'))

fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(path_rq1, 'Topic similarity visualization.html'))

# This uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
topics_new = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")

# persist the document topics
for index, topic in zip(indice, topics_new):
    df.at[index, 'Challenge_topic'] = topic

df = df[df.columns.drop(list(df.filter(regex=r'preprocessed|gpt_summary')))]
df.to_json(os.path.join(path_rq1, 'topics.json'), indent=4, orient='records')
zhimin-z commented 1 year ago

I am using bertopic==0.15.0, so I think it is related to the breaking update: https://stackoverflow.com/questions/76631305/attributeerror-cant-get-attribute-euclideandistance-on-module-sklearn-metr For those who still use scikit-learn==1.2.2, it is done. @MaartenGr

zhimin-z commented 1 year ago

However, if I upgrade scikit-learn==1.3.0, it give another error instead:

(.venv) 21zz42@docjk-gpu-02:~/Asset-Management-Topic-Modeling$ python /home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py
Traceback (most recent call last):
  File "/home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py", line 5, in <module>
    from bertopic import BERTopic
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/__init__.py", line 1, in <module>
    from bertopic._bertopic import BERTopic
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 37, in <module>
    import hdbscan
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/hdbscan/__init__.py", line 1, in <module>
    from .hdbscan_ import HDBSCAN, hdbscan
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/hdbscan/hdbscan_.py", line 40, in <module>
    FAST_METRICS = KDTree.valid_metrics + BallTree.valid_metrics + ["cosine", "arccos"]
TypeError: unsupported operand type(s) for +: 'builtin_function_or_method' and 'builtin_function_or_method'
zhimin-z commented 1 year ago

I found the only solution is to upgrade scikit-learn and hdbscan simultaneously. Would you mind making a new release accordingly?

MaartenGr commented 1 year ago

It seems that this is a known issue for HDBSCAN which should already be fixed in their main branch. There is a new version of HDBSCAN but there are some commits after that. I believe this mostly relates to version controlling your environment when you pickle BERTopic. When using BERTopic v0.15, it is highly advised using either pytorch or safetensors to save the model. This is more robust to changing environments and corresponding dependencies.

With respect to a new release, I think it is best to wait until HDBSCAN is a bit more stable seeing as there are still some individuals experiencing some issues.

zhimin-z commented 1 year ago

How about adding a safetensor dependency in the BerTopic? Whenever I use it, it shows missing package.

zhimin-z commented 1 year ago

When install safetensor dependency and saved the model as safetensor and load it again it shows the following error:

/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:557: RuntimeWarning: invalid value encountered in cast
  arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
Traceback (most recent call last):
  File "/home/21zz42/Asset-Management-Topic-Modeling/Code/RQ1/best_model.py", line 24, in <module>
    topic_model = BERTopic.load(os.path.join(path_model, model_name))
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 3008, in load
    topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
  File "/home/21zz42/Asset-Management-Topic-Modeling/.venv/lib/python3.10/site-packages/bertopic/_bertopic.py", line 4024, in _create_model_from_files
    topic_model.vectorizer_model = CountVectorizer(**ctfidf_config["vectorizer_model"]["params"])
TypeError: CountVectorizer.__init__() got an unexpected keyword argument 'norm'
MaartenGr commented 1 year ago

@zhimin-z Could you share your full code including how you saved and loaded the model? Also, are the environments in any way different between saving and loading the model?

zhimin-z commented 1 year ago

@zhimin-z Could you share your full code including how you saved and loaded the model? Also, are the environments in any way different between saving and loading the model?

Sure. THis is for saving model in the hyperparameter sweep:

import gensim.corpora as corpora
import pandas as pd
import wandb
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.coherencemodel import CoherenceModel
# from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

path_dataset = 'Dataset'
path_model = os.path.join('Result', 'RQ1', 'Model')
if not os.path.exists(path_model):
    os.makedirs(path_model)

wandb_project = 'asset-management-topic-modeling'

os.environ["WANDB_API_KEY"] = 'xxxxxxxx'
os.environ["TOKENIZERS_PARALLELISM"] = "true"
os.environ["WANDB__SERVICE_WAIT"] = "100"

# set default sweep configuration
config_defaults = {
    # Refer to https://www.sbert.net/docs/pretrained_models.html
    'model_name': 'sentence-transformers/all-mpnet-base-v2',
    'metric_distane': 'cosine',
    'calculate_probabilities': True,
    # 'reduce_frequent_words': True,
    'prediction_data': True,
    'low_memory': False,
    'min_cluster_size': 50,
    'random_state': 42,
    'ngram_range': 2
}

config_sweep = {
    'method': 'grid',
    'metric': {
        'name': 'Coherence CV',
        'goal': 'maximize',
    },
    'parameters': {
        'n_components': {
            'values': list(range(3,6)),
        },
    }
}

class TopicModeling:
    def __init__(self, column_name):
        # Initialize an empty list to store top models
        self.top_models = []
        self.path_model = path_model

        df = pd.read_json(os.path.join(path_dataset, 'preprocessed.json'))
        self.docs = df[df[column_name].map(len) > 0][column_name].tolist()

        config_sweep['name'] = column_name
        config_sweep['parameters']['min_samples'] = {
            'values': list(range(1, config_defaults['min_cluster_size'] + 1)),
        }

    def __train(self):
        # Initialize a new wandb run
        with wandb.init() as run:
            # update any values not set by sweep
            run.config.setdefaults(config_defaults)

            # Step 1 - Extract embeddings
            embedding_model = SentenceTransformer(run.config.model_name)

            # Step 2 - Reduce dimensionality
            umap_model = UMAP(n_components=wandb.config.n_components, metric=run.config.metric_distane, random_state=run.config.random_state, low_memory=run.config.low_memory)

            # Step 3 - Cluster reduced embeddings
            hdbscan_model = HDBSCAN(min_cluster_size=run.config.min_cluster_size, min_samples=wandb.config.min_samples, prediction_data=run.config.prediction_data)

            # Step 4 - Tokenize topics
            vectorizer_model = TfidfVectorizer(ngram_range=(1, run.config.ngram_range))

            # Step 5 - Create topic representation
            # ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=run.config.reduce_frequent_words)

            # Step 6 - Fine-tune topic representation
            representation_model = KeyBERTInspired()

            # All steps together
            topic_model = BERTopic(
                embedding_model=embedding_model,
                umap_model=umap_model,
                hdbscan_model=hdbscan_model,
                vectorizer_model=vectorizer_model,
                # ctfidf_model=ctfidf_model,
                representation_model=representation_model,
                calculate_probabilities=run.config.calculate_probabilities
            )

            topics, _ = topic_model.fit_transform(self.docs)

            # Preprocess Documents
            documents = pd.DataFrame({
                "Document": self.docs,
                "ID": range(len(self.docs)),
                "Topic": topics
            })
            documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
            cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

            # Extract vectorizer and analyzer from BERTopic
            vectorizer = topic_model.vectorizer_model
            analyzer = vectorizer.build_analyzer()

            # Extract features for Topic Coherence evaluation
            tokens = [analyzer(doc) for doc in cleaned_docs]
            dictionary = corpora.Dictionary(tokens)
            corpus = [dictionary.doc2bow(token) for token in tokens]
            topic_words = [[words for words, _ in topic_model.get_topic(topic)] for topic in range(len(set(topics))-1)]

            coherence_cv = CoherenceModel(
                topics=topic_words,
                texts=tokens,
                corpus=corpus,
                dictionary=dictionary,
                coherence='c_v'
            )

            coherence_umass = CoherenceModel(
                topics=topic_words,
                texts=tokens,
                corpus=corpus,
                dictionary=dictionary,
                coherence='u_mass'
            )

            coherence_cuci = CoherenceModel(
                topics=topic_words,
                texts=tokens,
                corpus=corpus,
                dictionary=dictionary,
                coherence='c_uci'
            )

            coherence_cnpmi = CoherenceModel(
                topics=topic_words,
                texts=tokens,
                corpus=corpus,
                dictionary=dictionary,
                coherence='c_npmi'
            )

            coherence_cv = coherence_cv.get_coherence()
            wandb.log({'Coherence CV': coherence_cv})
            wandb.log({'Coherence UMASS': coherence_umass.get_coherence()})
            wandb.log({'Coherence UCI': coherence_cuci.get_coherence()})
            wandb.log({'Coherence NPMI': coherence_cnpmi.get_coherence()})
            wandb.log({'Topic Number': topic_model.get_topic_info().shape[0] - 1})
            wandb.log({'Uncategorized Post Number': topic_model.get_topic_info().at[0, 'Count']})

            model_name = f'{config_sweep["name"]}_{run.id}'
            topic_model.save(os.path.join(self.path_model, model_name), serialization="safetensors", save_ctfidf=True, save_embedding_model=config_defaults['model_name'])

    def sweep(self):
        wandb.login()
        sweep_id = wandb.sweep(config_sweep, project=wandb_project)
        wandb.agent(sweep_id, function=self.__train)

and this is for loading the model with the highest coherence CV score:

import os
import pickle
import pandas as pd

from bertopic import BERTopic

path_rq1 = os.path.join('Result', 'RQ1')
path_model = os.path.join(path_rq1, 'Model')

embedding_model = 'sentence-transformers/all-mpnet-base-v2'
model_name = 'Challenge_preprocessed_gpt_summary_7of8v67c'
column = '_'.join(model_name.split('_')[:-1])

df = pd.read_json(os.path.join('Dataset', 'preprocessed.json'))
df['Challenge_topic'] = -1

indice = []
docs = []

for index, row in df.iterrows():
    if pd.notna(row[column]) and len(row[column]):
        indice.append(index)
        docs.append(row[column])

topic_model = BERTopic.load(os.path.join(path_model, model_name), embedding_model=embedding_model)
topic_number = topic_model.get_topic_info().shape[0] - 1
topics, probs = topic_model.transform(docs)

# persist the topic terms
with open(os.path.join(path_rq1, 'Topic terms.pickle'), 'wb') as handle:
    topic_terms = []
    for i in range(topic_number):
        topic_terms.append(topic_model.get_topic(i))
    pickle.dump(topic_terms, handle, protocol=pickle.HIGHEST_PROTOCOL)

fig = topic_model.visualize_topics()
fig.write_html(os.path.join(path_rq1, 'Topic visualization.html'))

fig = topic_model.visualize_barchart(top_n_topics=topic_number, n_words=10)
fig.write_html(os.path.join(path_rq1, 'Term visualization.html'))

fig = topic_model.visualize_heatmap()
fig.write_html(os.path.join(path_rq1, 'Topic similarity visualization.html'))

# This uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
topics_new = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")

# persist the document topics
for index, topic in zip(indice, topics_new):
    df.at[index, 'Challenge_topic'] = topic

df = df[df.columns.drop(list(df.filter(regex=r'preprocessed|gpt_summary')))]
df.to_json(os.path.join(path_rq1, 'topics.json'), indent=4, orient='records')

and my requirements:

bertopic==0.15.0
gensim==4.3.1
safetensors==0.3.1
wandb==0.15.8
hdbscan==0.8.33
scikit-learn==1.3.0

I am sure I have not touched dependency during model I/O, I doubt there exists some breaking update in the requirements. @MaartenGr

zhimin-z commented 1 year ago

I tried with pytorch serialization mode to save/load model, the save functions well, but load never works...TypeError: CountVectorizer.__init__() got an unexpected keyword argument 'norm'

The only option works now is to use the default save of BerTopic for my requirements.

MaartenGr commented 1 year ago

Strange, I can't seem to reproduce the issue. I am also surprised that there seems to be a 'norm' argument in the config. Could you show what is inside ctfidf_config.json? You can skip over the vocab, only the beginning is of interest here (ctfidf_model and vectorizer_model).

Aside from the above, it seems that it should work if you forego save_ctfidf=True but of course is not the most ideal solution.

zhimin-z commented 1 year ago

ctfidf_config.json is too large, here is the beginning of this file:

{
  "ctfidf_model": {
    "bm25_weighting": false,
    "reduce_frequent_words": false
  },
  "vectorizer_model": {
    "params": {
      "analyzer": "word",
      "binary": false,
      "decode_error": "strict",
      "encoding": "utf-8",
      "input": "content",
      "lowercase": true,
      "max_df": 1.0,
      "max_features": null,
      "min_df": 1,
      "ngram_range": [
        1,
        2
      ],
      "norm": "l2",
      "smooth_idf": true,
      "stop_words": null,
      "strip_accents": null,
      "sublinear_tf": false,
      "token_pattern": "(?u)\\b\\w\\w+\\b",
      "use_idf": true,
      "vocabulary": null
    },
    "vocab": {
      "model": 80843,
      "logs": 75753,
      "reports": 112906,
      "insights": 62909,
....
MaartenGr commented 1 year ago
  "norm": "l2",

Strange, that parameter should not be in the CountVectorizer class at all. Ah, now I see, you are using TfidfVectorizer instead of CountVectorizer. You should use CountVectorizer instead, that should solve your issue.

zhimin-z commented 1 year ago
  "norm": "l2",

Strange, that parameter should not be in the CountVectorizer class at all. Ah, now I see, you are using TfidfVectorizer instead of CountVectorizer. You should use CountVectorizer instead, that should solve your issue.

Thanks for your fast reply. @MaartenGr I have some questions now:

  1. Is TfidfVectorizer better than CountVectorizer in terms of preprocessing?
  2. If I already trained the model with TfidfVectorizer, is there any way to load the trained model afterward with safetensor?
MaartenGr commented 1 year ago

Is TfidfVectorizer better than CountVectorizer in terms of preprocessing?

You should actually not use the TfidfVectorizer since c-TF-IDF is applied on top of the vectorizer, which is expected to be a plain bag-of-words.

If I already trained the model with TfidfVectorizer , is there any way to load the trained model afterward with safetensor?

I think you can load the model if you remove all files belonging to the TfidfVectorizer. This would, however, create a more limited version of BERTopic. It would essentially be the same as using save_ctfidf=False.

zhimin-z commented 1 year ago

It works, thanks!