Closed themeaningofmeaning closed 9 months ago
I made another quick test to verify that BERTopic works while topic modeling fails when using one of the pre-trained models produces the same error as detailed in OP.
step 1. install only the required dependencies:
pip install bertopic
pip install sentence-transformers
pip safetensors
step 2. Verify that BERTopic works, which it does:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Fetch a small dataset
docs = fetch_20newsgroups(subset='all')['data'][:100] # Only take 100 documents for a quick test
# Create and fit the BERTopic model
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(docs)
# Display the generated topics
for topic in topic_model.get_topic_info().to_dict('records'):
print(topic)
Step 3. Update the code above to use a pre-trained model for topic modeling. As expected, it generates the same issue I posted in the OP - TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
import os
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import logging
# Initialize logging
logging.basicConfig(level=logging.INFO)
# Load a pretrained BERTopic model
try:
bertopic_model_wiki = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
logging.info("BERTopic model loaded successfully.")
except Exception as e:
logging.error(f"Failed to load BERTopic model: {e}")
exit(1)
# Sample text chunks for testing
chunks = [
"The quick brown fox jumps over the lazy dog.",
"Artificial Intelligence has transformed many industries.",
"The economic impact of global warming is significant."
]
# Generate topics
try:
topics, probs = bertopic_model_wiki.fit_transform(chunks)
for i, topic in enumerate(topics):
logging.info(f"Chunk {i+1}: '{chunks[i]}' --> Topic: {topic}")
except Exception as e:
logging.error(f"Error generating topics: {e}")
That is definitely to be expected! When you save a BERTopic model using either safetensors
or pytorch
, it removes the underlying UMAP and HDBSCAN models. This compresses the saved model significantly and does a major speed up in inference.
When you load the model, there are no models to use for .fit_transform
but there is also really no use case to do so. Running .fit_transform
overwrites the entire BERTopic model. This means that when you run .fit_transform
twice, the second run will completely override the previous .fit_transform
.
To illustrate, the following will load a pre-trained model:
bertopic_model_wiki = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
This model is pre-trained on a specific dataset. When you run the following:
topics, probs = bertopic_model_wiki.fit_transform(chunks)
You are starting completely from scratch (which has always been the functionality of any .fit
function) and essentially throwing away the loaded model. You are not fine-tuning the model using .fit_transform
here.
@MaartenGr Thank you for the fast reply and the thorough explanation! That makes a lot of sense. I used the transform method to assign topics to your the chunks, which now returns the topic IDs for each chunk. One could also use the get_topic method to retrieve the description of the topics based on these IDs as needed. All is working now.
import os
from bertopic import BERTopic
import logging
# Initialize logging
logging.basicConfig(level=logging.INFO)
# Load a pretrained BERTopic model
try:
bertopic_model_wiki = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
logging.info("BERTopic model loaded successfully.")
except Exception as e:
logging.error(f"Failed to load BERTopic model: {e}")
exit(1)
# Sample text chunks for testing
chunks = [
'The quick brown fox jumps over the lazy dog.',
'Artificial Intelligence has transformed many industries.',
'The economic impact of global warming is significant.'
]
# Inference: Assign topics to new documents
try:
topics, probs = bertopic_model_wiki.transform(chunks)
for i, topic in enumerate(topics):
logging.info(f"Chunk {i+1}: '{chunks[i]}' --> Assigned Topic ID: {topic}")
# To get the topic description, use the get_topic method
topic_description = bertopic_model_wiki.get_topic(topic)
logging.info(f"Topic Description for ID {topic}: {topic_description}")
except Exception as e:
logging.error(f"Error during topic inference: {e}")
Great! Glad to hear that the issue is resolved.
.fit_transform()
will no longer execute even on datasets used in BERTopic's example scripts. I haven't been able to get BERTopic's pre-trained models like BERTopic_Wikipedia or BERTopic_ArXiv to.fit_transform()' chunks. I've been getting
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneTypewhenever a chunk (i.e. strings) is passed into
bertopic_model_wiki.fit_transform([chunk])`. In this app, I'm using BERTopic.load("MaartenGr/BERTopic_Wikipedia") for topic modeling on a local document (usually an txt or epub) that is split into chunks of 450 with an overlap of 25. I only want to utilize BERTopic to generate topics{} that I will then add as metadata onto the embeddings before they are upsert into Pincone...the chunks themselves are being embedded using the sentence transformers model paraphrase-MiniLM-L6-v2.After pretty extensive testing, it seems like the issue is related to
...fit_transform(chunks)
. I even passed in['sample text', 'sample text', ....]
and it kept returning the same error. However, when I passed non-strings into the function, it returns "Make sure that the documents variable is an iterable containing strings only."All the logs indicate that the failure point occurs at
bertopic_model_wiki.fit_transform([chunk])
where apparently a NoneType continues to be returned even though I have verified thatchunk
is in fact an array of strings before passing it intobertopic_model_wiki.fit_transform()
:As a second test in a new environment (to make sure it wasn't my app or env), I also attempted using a fresh instance of BERTopic without any pre-trained model and still get the error:
For this test, the console logging shows:
Packages in both of my clean virtual environments are: