MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.96k stars 743 forks source link

Different number of topics for different training runs on the same dataset #1817

Open abdullahfurquan opened 6 months ago

abdullahfurquan commented 6 months ago

Hi,

I am facing issue. If I train bertopic on a same dataset multiple times, I am getting different number of topics .

As per the discussion in this thread: https://github.com/MaartenGr/BERTopic/issues/461 . I have tried below two ways . But neither resolved the issue. In both ways running code multiple times is giving different number of topics:-

I am running programme on amazon sagemaker notebook instance.

docs : this my list of documents used in training .

(1) from sklearn.feature_extraction.text import CountVectorizer from umap import UMAP from bertopic import BERTopic

vectorizer_model = CountVectorizer(ngram_range=(2, 3), stop_words='english') umap_model = UMAP(random_state=42) topic_model = BERTopic(vectorizer_model=vectorizer_model , umap_model=umap_model) topics, probabilities = topic_model.fit_transform(docs)

(2) from sklearn.feature_extraction.text import CountVectorizer from umap import UMAP from bertopic import BERTopic

vectorizer_model = CountVectorizer(ngram_range=(2, 3), stop_words='english') umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42) topic_model = BERTopic(vectorizer_model=vectorizer_model , umap_model=umap_model) topics, probabilities = topic_model.fit_transform(docs)

Thank you!

MaartenGr commented 6 months ago

Which version of BERTopic are you using? Also, did you make sure that when you run them multiple times they are in exactly the same environment? Lastly, could you also try it without vectorizer_model?

abdullahfurquan commented 6 months ago

Hi ,

(1) BERTopic version :--
bertopic.version : '0.16.0'

(2) I am using amazon sagemaker . I am running multiple iteration in the same .ipynb file . So i think the environment remains the same.

(3) Even without vectorizer_model I am getting different no of topics for different training run.

Below I have given my entire code if that helps you . I have done few data preprocessing step like melting , missing value etc but those preprocessing steps remains same for different runs . 'docs' is a list of document which is actually the final data that we are using for training :-

(A) Using : umap_model = UMAP(random_state=42)
run 1: topic count 1622 run 2: topic count 1621

code : - pip install bertopic

import pandas as pd from bertopic import BERTopic import bertopic from umap import UMAP

bertopic.--version--

-- Load the Data file_path = 'fs4.txt000' # Replace with your file path data = pd.read_csv(file_path, sep="\t") # Adjust delimiter if necessary data=pd.melt(data, id_vars='asin', value_vars=['item_name', 'bullet_point1', 'bullet_point2','bullet_point3','bullet_point4','bullet_point5'])

data.shape data.head()

data.isna().sum() data['value'] = data['value'].fillna(" ") data.isna().sum()

-- Prepare the list of documents docs = data['value'].tolist() docs[:10]

-- umap initialisation umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

-- run1 topic_model1 = BERTopic(umap_model=umap_model) topics1, probabilities1 = topic_model1.fit_transform(docs)

topic_info1=topic_model1.get_topic_info() topic_info1 = pd.DataFrame(topic_info1) topic_info1

-- run 2 topic_model2 = BERTopic(umap_model=umap_model) topics2, probabilities2 = topic_model2.fit_transform(docs)

topic_info2=topic_model2.get_topic_info() topic_info2 = pd.DataFrame(topic_info2) topic_info2

(B) using : umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42) : - run 1: topic count 1638 run 2: topic count 1636

Here everything remains same as A , except in place of : umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

I have used umap_model = UMAP(random_state=42)

MaartenGr commented 6 months ago

Hmmm, it should be reproducible generally with this. I just checked the UMAP repository and it seems a change was recently merged that fixes this. Could you try installing UMAP from that commit/PR and see whether that fix solves your issue?

abdullahfurquan commented 6 months ago

Hmmm, it should be reproducible generally with this. I just checked the UMAP repository and it seems a change was recently merged that fixes this. Could you try installing UMAP from that commit/PR and see whether that fix solves your issue?

I am using amazon sagemaker . I am not sure how to do it. Is there any instruction available as to how to install directly from that commit/PR .

I tried below from sagemaker terminal but getting error in step b :- (a)git clone https://github.com/lmcinnes/umap.git (b) sudo python setup.py install

MaartenGr commented 6 months ago

You can find more about installing from a commit here. A small tip though, ChatGPT/Google can also help you with these kinds of installation questions!

abdullahfurquan commented 6 months ago

I did below steps but still my results are not reproducible. :-

(1) I first downloaded the zip file of master branch of repo : https://github.com/lmcinnes/umap/tree/master . It will be downloaded with name : umap-master.zip

(2) pip install umap-master.zip

(3) I ran below command to goto the umap_.py file location and see manually if the changes that were mentioned in https://github.com/lmcinnes/umap/pull/1081 ( self.n_jobs = 1 ) is actually there or not . I can confirm that those changes are indeed there .
[ One weird thing is that when I used normal method of package installation i.e. pip install umap-learn , the changes ( self.njobs = 1 ) were not present in file umap.py ]

import umap import os

base_dir = umap.--file-- file_path = os.path.join(os.path.dirname(basedir),'umap.py') print(file_path)

/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/umap/umap_.py

MaartenGr commented 6 months ago

@abdullahfurquan I can't seem to reproduce the issue if I use the latest commit from UMAP's main branch.

I installed it as follows:

pip install git+https://github.com/lmcinnes/umap.git@ebe5051cf21e778beb9f473ac348e749e0e21d12
pip install bertopic

Then, I run the following which gives no errors:

import numpy as np
from umap import UMAP
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Get docs
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

# Run model 1
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model)
topics_1, probs_1 = topic_model_1.fit_transform(docs)

# Run model 2
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_2 = BERTopic(umap_model=umap_model)
topics_2, probs_2 = topic_model_2.fit_transform(docs)

# Check if the two models are equal by comparing the resulting topics and probabilities
assert topics_1 == topics_2
assert np.array_equal(probs_1, probs_2)
scarlett-k-nhs commented 5 months ago

Hi Maarten, I am having the same problem but this code and the issue above did not fix my issue. Essentially I am working in a VM, and when I run the same parameters and random state in one go, they are the same. If I then sign off for the night and re-sign into my VM the next morning they then produce different results.

MaartenGr commented 5 months ago

@scarlett-k-nhs Do you perhaps have a reproducible example? As of now, I cannot seem to reproduce the issue as shown in my message above. I cannot find the source of the issue if I cannot reproduce it.

mtaylor57 commented 5 months ago

Hi @MaartenGr thanks for your response. I am replying on behalf of @scarlett-k-nhs . Reproducing this issue has been quite difficult even for me because it seems to happen randomly and without an obvious cause. As Scarlett said, we have had problems in reproducing the same topics from day-to-day. I am working in a VM which connects to a remote linux compute on Azure Machine Learning Studio. I am working in the same notebook, in the same environment each day but one day I will produce 15 topics and the next only 4 (for example).

Am I correct in saying that the only source of randomisation is UMAP (which should be prevented by setting random_state) or are there other sources of randomness that I am missing?

A couple of working theories I have are:

Would really appreciate any thoughts you have on this. The code is below but I think it may be diffcult to reproduce since reproducibility itself seems to be the problem.

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
from bertopic.representation import KeyBERTInspired

import pandas as pd
import numpy as np

dataset = #get dataset from remote datastore
my_data = dataset.to_pandas_dataframe()
my_data = my_data.iloc[820:]# get the unlabeled data only
my_data = my_data[['Comment ID','Comment Title','Comment Text']]

docs = my_data["Comment Text"].values.tolist()

umap_model = UMAP(random_state=42,n_components=4)
hdbscan_model = HDBSCAN(min_cluster_size=100,prediction_data=True)
english_stopwords = stopwords.words('english')
my_stopwords = english_stopwords #+some custom stop words
print(my_stopwords)
vectorizer_model = CountVectorizer(stop_words=my_stopwords)
keybert_model = KeyBERTInspired()
model = BERTopic(umap_model=umap_model,hdbscan_model=hdbscan_model,vectorizer_model=vectorizer_model,representation_model=keybert_model)
topics, probs = model.fit_transform(documents=docs)
MaartenGr commented 5 months ago

@mtaylor57

Could you share the version of all dependencies in your environment as well as the Python version? Also, did you try to install UMAP from its main branch as suggested above?

The code is below but I think it may be diffcult to reproduce since reproducibility itself seems to be the problem.

That's the thing. If I have some code that, when run multiple times gives a different output each time, then I at least understand under what condition the issues appear. Moreover, it allows me to try out some things, run it a couple of times, and check whether the output is stable.

For instance, the code I shared demonstrates that there is some stability if the exact same pipeline is run twice.

mtaylor57 commented 5 months ago

Hi @MaartenGr my package versions of bertopic and direct dependencies are: bertopic==0.16.0 numpy==1.23.5 pandas==2.0.3 plotly==5.19.0 scikit-learn==1.3.2 sentence-transformers==2.3.1 tqdm==4.66.2 umap-learn==0.5.5 numba==0.58.1 let me know if you need the versions of any more.

I have not yet tried installing umap from the main branch, I will give it a go.

To clarify, my results seem reproducible during a single day but once I've shut down and restarted the next day that's when I've been experiencing the problem.

MaartenGr commented 5 months ago

@mtaylor57 Can you share all dependencies? For instance, I am missing HDBSCAN.

I have not yet tried installing umap from the main branch, I will give it a go.

Thanks, looking forward to see if that helps.

To clarify, my results seem reproducible during a single day but once I've shut down and restarted the next day that's when I've been experiencing the problem.

When you shut down and restart the environment, could it be that there is some update to the environment between them? For instance, do you need to re-install packages or could it be that the server updates some packages?