MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.21k stars 767 forks source link

Set random seed in `hierarchical_topics`? #1899

Open serenalotreck opened 8 months ago

serenalotreck commented 8 months ago

I've set the random seed when I fit my topic model, and I'm getting reproducible results. I'm using the following:

def fit_reduce_model(rep_model, docs):
    """
    Defines all component models internally besides the representation model, which is the only one that changes.
    Pre-calculates embeddings, fits model, and performs outlier reduction.

    parameters:
        rep_model, class instance from bertopic.representation: representation model
        docs, list of str: documents to model

    returns:
        topic_model, BERTopic model: fitted model with outliers reduced
    """
    # Define all component models
    print('Defining component models...')
    sentence_model = SentenceTransformer('allenai/scibert_scivocab_cased')
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
    ## Using default HDBSCAN model, no definition needed
    vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
    representation_model=rep_model

    # Pre-calculate embeddings
    print('Calculating embeddings...')
    embeddings = sentence_model.encode(docs, show_progress_bar=True)
    # We reduce our embeddings to 2D as it will allows us to quickly iterate later on
    reduced_embeddings = umap_model.fit_transform(embeddings)

    # Fit the model
    print('Fitting model...')
    topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, representation_model=representation_model, vectorizer_model=vectorizer_model)
    topics, probs = topic_model.fit_transform(docs, embeddings)

    # Reduce outliers
    print('Reducing outliers...')
    new_topics = topic_model.reduce_outliers(docs, topics, strategy='embeddings', threshold=0.1) # This method ends up reducing all outliers even with this threshold
    topic_model.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model, representation_model=representation_model)

    return topic_model

However, when I run the following, I get varied results:

# Fit the model
mmr_rep_model = MaximalMarginalRelevance(diversity=0.3)
mmr_model = fit_reduce_model(mmr_rep_model, docs)

# Generate hierarchical topics
hierarchical_topics = mmr_model.hierarchical_topics(docs)
fig = mmr_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.show()

I don't see a way in the docs to set a random seed for hierarchical_topics; let me know if I've overlooked something!

MaartenGr commented 8 months ago

Just to be sure, do you get varied results out of .hierarchical_topics or out of .visualize_hierarchy? They are different code bases and your code suggests you get varied results from .visualize_hierarchy and not .hierarchical_topics.

serenalotreck commented 8 months ago

Why would .visualize_hierarchy be different if hierarchical_topics is the same, since hierarchical_topics is passed to .visualize_hierarchy? I can go check, but if that is the case, I'd like a way to set a random seed for .visualize_hierarchy!

serenalotreck commented 8 months ago

Ok I checked, and it is .hierarchical_topics that's giving different results, I saved out the results and read them back in, and when I ran .visualize_hierarchy, I got the same visualization.

MaartenGr commented 8 months ago

Why would .visualize_hierarchy be different if hierarchical_topics is the same, since hierarchical_topics is passed to .visualize_hierarchy?

They are different code bases, so any randomness can appear in either function. There also has been randomness in visualization functions before.

Ok I checked, and it is .hierarchical_topics that's giving different results, I saved out the results and read them back in, and when I ran .visualize_hierarchy, I got the same visualization.

That's good to know! Looking through the code of .hierarchical_topics (assuming you are using v0.16 of BERTopic), I don't see anything that would explain this.

Does this also happen if you run it with 20NewsGroups? Could you create a self-contained reproducible example? That way, I can more easily find the issue.

serenalotreck commented 8 months ago

Here is the code with 20NewsGroups:

from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
import openai
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

def fit_reduce_model(rep_model, docs):
    """
    Defines all component models internally besides the representation model, which is the only one that changes.
    Pre-calculates embeddings, fits model, and performs outlier reduction.

    parameters:
        rep_model, class instance from bertopic.representation: representation model
        docs, list of str: documents to model

    returns:
        topic_model, BERTopic model: fitted model with outliers reduced
    """
    # Define all component models
    print('Defining component models...')
    sentence_model = SentenceTransformer('allenai/scibert_scivocab_cased')
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
    ## Using default HDBSCAN model, no definition needed
    vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
    representation_model=rep_model

    # Pre-calculate embeddings
    print('Calculating embeddings...')
    embeddings = sentence_model.encode(docs, show_progress_bar=True)
    # We reduce our embeddings to 2D as it will allows us to quickly iterate later on
    reduced_embeddings = umap_model.fit_transform(embeddings)

    # Fit the model
    print('Fitting model...')
    topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, representation_model=representation_model, vectorizer_model=vectorizer_model)
    topics, probs = topic_model.fit_transform(docs, embeddings)

    # Reduce outliers
    print('Reducing outliers...')
    new_topics = topic_model.reduce_outliers(docs, topics, strategy='embeddings', threshold=0.1) # This method ends up reducing all outliers even with this threshold
    topic_model.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model, representation_model=representation_model)

    return topic_model

representation_model = MaximalMarginalRelevance(diversity=0.3)
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))["data"]

model = fit_reduce_model(representation_model, docs)

hierarchical_topics = model.hierarchical_topics(docs)
hierarchical_topics.head()

fig = model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.show()

It looks like neither of the two functions introduces randomness for the 20NewsGroups (see screenshots below), which is super odd to me, because the only difference between this code and what I did previously are the input docs, and I wouldn't expect that to matter.


Running it the first time:

image

Running just the visualization again:

image

Running both the hierarchical topic generation and the visualization again:

image

MaartenGr commented 7 months ago

I wouldn't expect the input documents to have this kind of influence and I expect that it does stem from either a difference in your code or your environment. DId you make sure that the environments you worked in between when you could produce the code with your data and the 20NewsGroups are exactly the same? As in, same python version, dependency versions (even numpy, numba, pandas, etc)?

serenalotreck commented 7 months ago

Yes they're the same! I launched a jupyter notebook instance using the same kernel that I made from a conda environment, and haven't changed any packages in the conda environment between having noticed the issue and trying to reproduce it.

MaartenGr commented 7 months ago

In that case, I'm not entirely sure what is happening here. The data should not influence whether something is reproducible or not, they should not influence any stochasticity or randomness unless the data itself is random.

serenalotreck commented 7 months ago

Very weird.. I'm going to stick with reading in the hierarcical topics when I need to regenerate the figure for now, I'll let you know if I figure anything else out. Thanks for your help!