Open serenalotreck opened 8 months ago
Just to be sure, do you get varied results out of .hierarchical_topics
or out of .visualize_hierarchy
? They are different code bases and your code suggests you get varied results from .visualize_hierarchy
and not .hierarchical_topics
.
Why would .visualize_hierarchy
be different if hierarchical_topics
is the same, since hierarchical_topics
is passed to .visualize_hierarchy
? I can go check, but if that is the case, I'd like a way to set a random seed for .visualize_hierarchy
!
Ok I checked, and it is .hierarchical_topics
that's giving different results, I saved out the results and read them back in, and when I ran .visualize_hierarchy
, I got the same visualization.
Why would .visualize_hierarchy be different if hierarchical_topics is the same, since hierarchical_topics is passed to .visualize_hierarchy?
They are different code bases, so any randomness can appear in either function. There also has been randomness in visualization functions before.
Ok I checked, and it is .hierarchical_topics that's giving different results, I saved out the results and read them back in, and when I ran .visualize_hierarchy, I got the same visualization.
That's good to know! Looking through the code of .hierarchical_topics
(assuming you are using v0.16 of BERTopic), I don't see anything that would explain this.
Does this also happen if you run it with 20NewsGroups? Could you create a self-contained reproducible example? That way, I can more easily find the issue.
Here is the code with 20NewsGroups:
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
import openai
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
def fit_reduce_model(rep_model, docs):
"""
Defines all component models internally besides the representation model, which is the only one that changes.
Pre-calculates embeddings, fits model, and performs outlier reduction.
parameters:
rep_model, class instance from bertopic.representation: representation model
docs, list of str: documents to model
returns:
topic_model, BERTopic model: fitted model with outliers reduced
"""
# Define all component models
print('Defining component models...')
sentence_model = SentenceTransformer('allenai/scibert_scivocab_cased')
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
## Using default HDBSCAN model, no definition needed
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
representation_model=rep_model
# Pre-calculate embeddings
print('Calculating embeddings...')
embeddings = sentence_model.encode(docs, show_progress_bar=True)
# We reduce our embeddings to 2D as it will allows us to quickly iterate later on
reduced_embeddings = umap_model.fit_transform(embeddings)
# Fit the model
print('Fitting model...')
topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, representation_model=representation_model, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs, embeddings)
# Reduce outliers
print('Reducing outliers...')
new_topics = topic_model.reduce_outliers(docs, topics, strategy='embeddings', threshold=0.1) # This method ends up reducing all outliers even with this threshold
topic_model.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model, representation_model=representation_model)
return topic_model
representation_model = MaximalMarginalRelevance(diversity=0.3)
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))["data"]
model = fit_reduce_model(representation_model, docs)
hierarchical_topics = model.hierarchical_topics(docs)
hierarchical_topics.head()
fig = model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.show()
It looks like neither of the two functions introduces randomness for the 20NewsGroups (see screenshots below), which is super odd to me, because the only difference between this code and what I did previously are the input docs, and I wouldn't expect that to matter.
Running it the first time:
Running just the visualization again:
Running both the hierarchical topic generation and the visualization again:
I wouldn't expect the input documents to have this kind of influence and I expect that it does stem from either a difference in your code or your environment. DId you make sure that the environments you worked in between when you could produce the code with your data and the 20NewsGroups are exactly the same? As in, same python version, dependency versions (even numpy, numba, pandas, etc)?
Yes they're the same! I launched a jupyter notebook instance using the same kernel that I made from a conda environment, and haven't changed any packages in the conda environment between having noticed the issue and trying to reproduce it.
In that case, I'm not entirely sure what is happening here. The data should not influence whether something is reproducible or not, they should not influence any stochasticity or randomness unless the data itself is random.
Very weird.. I'm going to stick with reading in the hierarcical topics when I need to regenerate the figure for now, I'll let you know if I figure anything else out. Thanks for your help!
I've set the random seed when I fit my topic model, and I'm getting reproducible results. I'm using the following:
However, when I run the following, I get varied results:
I don't see a way in the docs to set a random seed for
hierarchical_topics
; let me know if I've overlooked something!