MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.03k stars 756 forks source link

merge_topics ambiguous output #979

Closed pragalbh-dev closed 1 year ago

pragalbh-dev commented 1 year ago

I am Passing a list of 147 lists of topics to merge topics. SO it should merge the topics in each corresponding list and the updated model should have 147 topics instead the revised model has 100 topics. I have a hierarchical textual data. SO I was trying to get topics of level 4 in the hierarchy and merge the children of level3 to get the topics for level3 in the heirarchy and so on. But due to inconsistent output size I am unable to map to parent class.


# topic_model_sup is a supervised topic model 
docs=main_data_sup['description'].to_list()
y=main_data_sup['dummy_y'].to_list()
embeddings=np.array(main_data_sup['embeddings'].to_list())
from sentence_transformers import SentenceTransformer
embeddings_model=SentenceTransformer('all-roberta-large-v1')
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.dimensionality import BaseDimensionalityReduction
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english")
from bertopic import BERTopic

# Create instances of GPU-accelerated UMAP and HDBSCAN
# umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
# hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

empty_dimensionality_model = BaseDimensionalityReduction()
clf = LogisticRegression()
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Create a fully supervised BERTopic instance
topic_model_sup= BERTopic(n_gram_range=(1, 2),
        umap_model=empty_dimensionality_model,
        hdbscan_model=clf,
        ctfidf_model=ctfidf_model,verbose=True,vectorizer_model=vectorizer_model,diversity=0.2,embedding_model=embeddings_model)
topic_model_sup = topic_model_sup.fit(docs,embeddings=embeddings, y=y)
mappings = topic_model_sup.topic_mapper_.get_mappings()
mappings = {value: dummy_to_class_map[key] for key, value in mappings.items()}

topic_to_industry_lev3_map=mappings

industry_lev3_to_topic_map={}
for k,v in topic_to_industry_lev3_map.items():
    industry_lev3_to_topic_map[v]=k

topic_model_prev=topic_model_sup
lev=2

docs=main_data_sup.description.to_list()

class_to_lev_topic_id=industry_lev3_to_topic_map

new_class_old_classes_map=main_data_sup.groupby('level'+str(lev))['level'+str(lev+1)].unique()
new_class_old_topics_map=new_class_old_classes_map.apply(lambda x: [class_to_lev_topic_id[i] for i in x])

# topics_to_merge is a list of lists. the topic_ids of the original model are clubbed into lists such that each list contains #sibling classes in the heirarchy

topics_to_merge=new_class_old_topics_map.to_list()

topic_model_prev.merge_topics(docs, topics_to_merge)
MaartenGr commented 1 year ago

The way you shared your code makes it quite difficult to read, could you perhaps use code blocks in your post as described here?

Also, could you show what is exactly in topics_to_merge?

pragalbh-dev commented 1 year ago

The way you shared your code makes it quite difficult to read, could you perhaps use code blocks in your post as described here?

Also, could you show what is exactly in topics_to_merge?

Edited the issues accordingly

MaartenGr commented 1 year ago

It is difficult to say what exactly is going on but there might be something with the topics_to_merge variable that is being created. Did you make sure that each set of topics contains topics that are not found in any other set of topics?

pragalbh-dev commented 1 year ago

Yes the topics are exclusive

MaartenGr commented 1 year ago

Then I am not entirely sure what is happening. Could you perhaps create a reproducible example of your error?

pragalbh-dev commented 1 year ago

Sure

On Sun, Feb 12, 2023, 7:41 PM Maarten Grootendorst @.***> wrote:

Then I am not entirely sure what is happening. Could you perhaps create a reproducible example of your error?

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/979#issuecomment-1427043522, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQTBBYVMBNESPXBQDMIDRUTWXDVPTANCNFSM6AAAAAAUQBK4QE . You are receiving this because you authored the thread.Message ID: @.***>

-- The information contained in this electronic communication is intended solely for the individual(s) or entity to which it is addressed. It may contain proprietary, confidential and/or legally privileged information. Any review, retransmission, dissemination, printing, copying or other use of, or taking any action in reliance on the contents of this information by person(s) or entities other than the intended recipient is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us by responding to this email or telephone and immediately and permanently delete all copies of this message and any attachments from your system(s). The contents of this message do not necessarily represent the views or policies of BITS Pilani.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!