MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType #1723

Closed themeaningofmeaning closed 9 months ago

themeaningofmeaning commented 10 months ago

.fit_transform() will no longer execute even on datasets used in BERTopic's example scripts. I haven't been able to get BERTopic's pre-trained models like BERTopic_Wikipedia or BERTopic_ArXiv to .fit_transform()' chunks. I've been gettingTypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneTypewhenever a chunk (i.e. strings) is passed intobertopic_model_wiki.fit_transform([chunk])`. In this app, I'm using BERTopic.load("MaartenGr/BERTopic_Wikipedia") for topic modeling on a local document (usually an txt or epub) that is split into chunks of 450 with an overlap of 25. I only want to utilize BERTopic to generate topics{} that I will then add as metadata onto the embeddings before they are upsert into Pincone...the chunks themselves are being embedded using the sentence transformers model paraphrase-MiniLM-L6-v2.

After pretty extensive testing, it seems like the issue is related to ...fit_transform(chunks) . I even passed in ['sample text', 'sample text', ....] and it kept returning the same error. However, when I passed non-strings into the function, it returns "Make sure that the documents variable is an iterable containing strings only."

All the logs indicate that the failure point occurs at bertopic_model_wiki.fit_transform([chunk]) where apparently a NoneType continues to be returned even though I have verified that chunk is in fact an array of strings before passing it into bertopic_model_wiki.fit_transform():

import os
from flask import Flask, request, render_template, jsonify
from werkzeug.utils import secure_filename
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import pinecone
from dotenv import load_dotenv
import utils
import logging

# Load environment variables
load_dotenv()

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

# API Keys and Configurations
HUGGINGFACE_API_KEY = os.environ.get('HUGGINGFACE_API_KEY')
PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
PINECONE_ENV = os.environ.get('PINECONE_ENV')
PINECONE_INDEX = os.environ.get('PINECONE_INDEX')
CHUNK_SIZE = int(os.environ.get('CHUNK_SIZE', 450))
OVERLAP = int(os.environ.get('OVERLAP', 25))
app.config['MAX_CONTENT_LENGTH'] = 50 * 1024 * 1024  # 50MB limit

# Model loading
sentence_model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')

try:
    bertopic_model_wiki = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
    logging.info("BERTopic model loaded successfully.")
except Exception as e:
    logging.error(f"Failed to load BERTopic model: {e}")

# Pinecone initialization
pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)
pinecone_index = pinecone.Index(PINECONE_INDEX)

@app.route('/status')
def get_status():
    with open('status.txt', 'r') as file:
        status = file.read()
    return jsonify({'status': status})

@app.route('/', methods=['GET', 'POST'])
def upload_file():
    try:
        if request.method == 'POST':
            uploaded_file = request.files['file']
            if uploaded_file:
                filename = secure_filename(uploaded_file.filename)
                file_path = os.path.join('uploads', filename)
                uploaded_file.save(file_path)

                logging.info(f"File {filename} uploaded successfully")

                zero_shot_topics = request.form.get('zero_shot_topics').split(',')
                logging.info("Received zero shot topic list")

                text = utils.extract_text_from_file(file_path)
                if not text:
                    raise ValueError(f"No text extracted from {filename}")

                chunks = utils.chunk_text(text, CHUNK_SIZE, OVERLAP)
                if not chunks:
                    raise ValueError("No chunks created from the text")

                for i, chunk in enumerate(chunks):
                    logging.info(f"Processing chunk {i+1}")
                    embedding = sentence_model.encode([chunk])
                    topics, _ = bertopic_model_wiki.fit_transform([chunk])

                    if topics is None or len(topics) == 0:
                        logging.warning(f"No topics generated for chunk {i+1}, skipping.")
                        continue

                    data_to_upsert = {
                        "text": chunk,
                        "embedding": embedding.tolist(),
                        "topics": topics
                    }
                    pinecone_index.upsert({filename: data_to_upsert})
                    logging.info(f"Chunk {i+1} of {filename} upserted to Pinecone")

                os.remove(file_path)
                return render_template('index.html', status="Processing and Upsert Complete")
            else:
                raise ValueError("No file uploaded")
    except Exception as e:
        error_message = "An error occurred: " + str(e)
        logging.error(f"Error processing chunk {i+1}: {e}")
        return render_template('index.html', status=error_message)

    return render_template('index.html')

if __name__ == '__main__':
    app.run(debug=True)

As a second test in a new environment (to make sure it wasn't my app or env), I also attempted using a fresh instance of BERTopic without any pre-trained model and still get the error:

# Create a new instance of BERTopic
fresh_bertopic_model = BERTopic(verbose=True)

# Test string
test_string = "This is a simple test to check the BERTopic model functionality."
logging.info(f"test_string a {type(test_string)} and contains {test_string}")

# Run BERTopic on the test string
try:
    test_topics, _ = fresh_bertopic_model.fit_transform([test_string])
    logging.info(f"Test topics: {test_topics}")
except Exception as e:
    logging.error(f"Error during BERTopic test with fresh model: {e}")

For this test, the console logging shows:

INFO:root:test string is a <class 'str'> and contains This is a simple test to check the BERTopic model functionality.
2024-01-04 19:12:31,597 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|███████████████████████████████████| 1/1 [00:00<00:00, 129.91it/s]
2024-01-04 19:12:31,605 - BERTopic - Embedding - Completed ✓
2024-01-04 19:12:31,606 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-01-04 19:12:31,606 - BERTopic - Dimensionality - Completed ✓
2024-01-04 19:12:31,606 - BERTopic - Cluster - Start clustering the reduced embeddings
ERROR:root:Error during BERTopic test: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Packages in both of my clean virtual environments are:

Package              Version
--------------------- ------------
aiohttp               3.9.1
aiosignal             1.3.1
annotated-types       0.6.0
anyio                 4.2.0
async-timeout         4.0.3
attrs                 23.2.0
bertopic              0.16.0
blinker               1.7.0
certifi               2023.11.17
charset-normalizer    3.3.2
click                 8.1.7
Cython                0.29.37
dataclasses-json      0.6.3
dnspython             2.4.2
EbookLib              0.18
exceptiongroup        1.2.0
filelock              3.13.1
Flask                 3.0.0
frozenlist            1.4.1
fsspec                2023.12.2
gunicorn              21.2.0
hdbscan               0.8.33
huggingface-hub       0.20.1
idna                  3.6
importlib-metadata    7.0.1
itsdangerous          2.1.2
Jinja2                3.1.2
joblib                1.3.2
jsonpatch             1.33
jsonpointer           2.4
langchain             0.0.354
langchain-community   0.0.8
langchain-core        0.1.6
langsmith             0.0.77
llvmlite              0.41.1
loguru                0.7.2
lxml                  5.0.0
MarkupSafe            2.1.3
marshmallow           3.20.1
mpmath                1.3.0
multidict             6.0.4
mypy-extensions       1.0.0
networkx              3.2.1
nltk                  3.8.1
numba                 0.58.1
numpy                 1.26.3
packaging             23.2
pandas                2.1.4
pillow                10.2.0
pinecone-client       2.2.4
pip                   23.3.2
plotly                5.18.0
pydantic              2.5.3
pydantic_core         2.14.6
pynndescent           0.5.11
PyPDF2                3.0.1
python-dateutil       2.8.2
python-docx           1.1.0
python-dotenv         1.0.0
pytz                  2023.3.post1
PyYAML                6.0.1
regex                 2023.12.25
requests              2.31.0
safetensors           0.4.1
scikit-learn          1.3.2
scipy                 1.11.4
sentence-transformers 2.2.2
sentencepiece         0.1.99
setuptools            58.0.4
six                   1.16.0
sniffio               1.3.0
SQLAlchemy            2.0.25
sympy                 1.12
tenacity              8.2.3
threadpoolctl         3.2.0
tiktoken              0.5.2
tokenizers            0.15.0
torch                 2.1.2
torchvision           0.16.2
tqdm                  4.66.1
transformers          4.36.2
typing_extensions     4.9.0
typing-inspect        0.9.0
tzdata                2023.4
umap-learn            0.5.5
urllib3               2.1.0
Werkzeug              3.0.1
yarl                  1.9.4
zipp                  3.17.0
themeaningofmeaning commented 10 months ago

I made another quick test to verify that BERTopic works while topic modeling fails when using one of the pre-trained models produces the same error as detailed in OP.

step 1. install only the required dependencies:

pip install bertopic
pip install sentence-transformers
pip safetensors

step 2. Verify that BERTopic works, which it does:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Fetch a small dataset
docs = fetch_20newsgroups(subset='all')['data'][:100]  # Only take 100 documents for a quick test

# Create and fit the BERTopic model
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(docs)

# Display the generated topics
for topic in topic_model.get_topic_info().to_dict('records'):
    print(topic)

Step 3. Update the code above to use a pre-trained model for topic modeling. As expected, it generates the same issue I posted in the OP - TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

import os
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import logging

# Initialize logging
logging.basicConfig(level=logging.INFO)

# Load a pretrained BERTopic model
try:
    bertopic_model_wiki = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
    logging.info("BERTopic model loaded successfully.")
except Exception as e:
    logging.error(f"Failed to load BERTopic model: {e}")
    exit(1)

# Sample text chunks for testing
chunks = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial Intelligence has transformed many industries.",
    "The economic impact of global warming is significant."
]

# Generate topics
try:
    topics, probs = bertopic_model_wiki.fit_transform(chunks)
    for i, topic in enumerate(topics):
        logging.info(f"Chunk {i+1}: '{chunks[i]}' --> Topic: {topic}")
except Exception as e:
    logging.error(f"Error generating topics: {e}")
MaartenGr commented 10 months ago

That is definitely to be expected! When you save a BERTopic model using either safetensors or pytorch, it removes the underlying UMAP and HDBSCAN models. This compresses the saved model significantly and does a major speed up in inference.

When you load the model, there are no models to use for .fit_transform but there is also really no use case to do so. Running .fit_transform overwrites the entire BERTopic model. This means that when you run .fit_transform twice, the second run will completely override the previous .fit_transform.

To illustrate, the following will load a pre-trained model:

bertopic_model_wiki = BERTopic.load("MaartenGr/BERTopic_Wikipedia")

This model is pre-trained on a specific dataset. When you run the following:

topics, probs = bertopic_model_wiki.fit_transform(chunks)

You are starting completely from scratch (which has always been the functionality of any .fit function) and essentially throwing away the loaded model. You are not fine-tuning the model using .fit_transform here.

themeaningofmeaning commented 9 months ago

@MaartenGr Thank you for the fast reply and the thorough explanation! That makes a lot of sense. I used the transform method to assign topics to your the chunks, which now returns the topic IDs for each chunk. One could also use the get_topic method to retrieve the description of the topics based on these IDs as needed. All is working now.

import os
from bertopic import BERTopic
import logging

# Initialize logging
logging.basicConfig(level=logging.INFO)

# Load a pretrained BERTopic model
try:
    bertopic_model_wiki = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
    logging.info("BERTopic model loaded successfully.")
except Exception as e:
    logging.error(f"Failed to load BERTopic model: {e}")
    exit(1)

# Sample text chunks for testing
chunks = [
    'The quick brown fox jumps over the lazy dog.',
    'Artificial Intelligence has transformed many industries.',
    'The economic impact of global warming is significant.'
]

# Inference: Assign topics to new documents
try:
    topics, probs = bertopic_model_wiki.transform(chunks)
    for i, topic in enumerate(topics):
        logging.info(f"Chunk {i+1}: '{chunks[i]}' --> Assigned Topic ID: {topic}")

        # To get the topic description, use the get_topic method
        topic_description = bertopic_model_wiki.get_topic(topic)
        logging.info(f"Topic Description for ID {topic}: {topic_description}")

except Exception as e:
    logging.error(f"Error during topic inference: {e}")
MaartenGr commented 9 months ago

Great! Glad to hear that the issue is resolved.