MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Zero topic distributions for some documents using approximate_distribution() #2150

Open Connorwz opened 2 months ago

Connorwz commented 2 months ago

Have you searched existing issues? 🔎

Desribe the bug

Dear creators of BERTopic, Thanks for your work and this package is amazing. I have been using it for a long time. However, I found some documents (no matter whether they are used to train the model) have zero topic distributions for all topics created by BERTopic after applying approximate_distribution() function on them. It means that the topic distribution matrix produced by approximate_distribution() has some rows having sum of 0. Codes below did several things: (1) build a simple setup for a BERTopic model with PCA and KMEANS (from cuML) as the dimension reduction and clustering technique. (2) define a splitting function to split documents and pre-caculated embeddings. (3) fit the model on training data and compute topic distributions for both training and testing data set.

If more information is needed, please let me know. Thanks!

Reproduction

def pk(num_cluster):
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    umap_model = PCA(n_components = 10)
    hdbscan_model = KMeans(n_clusters = num_cluster)
    vectorizer_model = CountVectorizer()
    Topic_model = BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model,
                        calculate_probabilities = False,verbose = True)
    return Topic_model
def tr_te_split(documents,df,embeddings, i=1):
    indices = np.arange(len(documents))
    tr_ind, te_ind = train_test_split(indices, test_size=0.2, shuffle= True, random_state=i)
    tr_df = df.iloc[tr_ind,:]
    te_df = df.iloc[te_ind,:]
    tr_documents = [documents[ind] for ind in tr_ind]
    te_documents = [documents[ind] for ind in te_ind]
    tr_embeddings = embeddings[tr_ind,:]
    return tr_df,te_df,tr_documents,te_documents,tr_embeddings
def check_zero_exposure(arr):
    if 0 in  np.apply_along_axis(arr=arr,func1d=np.sum,axis=1):
        return 1
zero_exposures = {}
for year in year_list:
    df = pd.read_csv(df_folder+f"/contem_{year}_senti.csv")
    documents = df.documents.tolist()
    embeddings = np.load(embeddings_folder+f"/contem_{year}_senti_embeddings.npy")
    tr_df, te_df, tr_documents,te_documents,tr_embeddings = tr_te_split(documents,df,embeddings)
    tr_df.reset_index(drop=True,inplace=True)
    te_df.reset_index(drop=True,inplace=True)
    topic_model = pk(cluster_num)
    topic_model.fit(tr_documents,tr_embeddings)
    tr_topic_dist, _ = topic_model.approximate_distribution(tr_documents)
    te_topic_dist, _ = topic_model.approximate_distribution(te_documents)
    zero_exposure = [check_zero_exposure(tr_topic_dist),check_zero_exposure(te_topic_dist)]
    zero_exposures[year] = zero_exposure

# zero_exposures
# {2014: [1, 1],
#  2015: [1, 1],
#  2016: [1, 1],
#  2017: [1, 1],
#  2018: [1, 1],
#  2019: [1, 1],
#  2020: [1, 1],
#  2021: [1, 1],
#  2022: [1, 1],
#  2023: [1, 1]}

BERTopic Version

0.16.2

MaartenGr commented 2 months ago

Have you tried looking at some of the hyperparameters of approximate_distribution? Since there are similarity metrics/values involved, it might help to look at whether you can reduce the minimum similarity necessary. You can find more about some of them here.

Connorwz commented 2 months ago

Thanks for your reply! However, may I ask why minimum similarity affects my problem that i have some documents have no probabilities to any one of clusters/topics created by the model?

MaartenGr commented 2 months ago

Sure! You need a minimum similarity to decide which subset of topics is most related to your document. It allows you to filter the most related topics. By lowering the minimum similarity, you will get more topics related to the document (although their similarity values will not change).

Connorwz commented 2 months ago

Thanks for your explanations! So there is a mechanism within approximate_distribution() that if similarities between this document and all topics are below the minimum similarity, it shows zero probability to all of them. Besides, probabilities are calculated by the weighted similarities for those topics whose similarities with the document are above the minimum similarity?

MaartenGr commented 2 months ago

So there is a mechanism within approximate_distribution() that if similarities between this document and all topics are below the minimum similarity, it shows zero probability to all of them.

That's correct!

Besides, probabilities are calculated by the weighted similarities for those topics whose similarities with the document are above the minimum similarity?

Yes! In practice, it calculates all the similarities and then just ignores those that do not exceed the threshold but the result is the same.