MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.21k stars 767 forks source link

model.visualize_hierarchical_documents now shows whole topics #1502

Open smbslt3 opened 1 year ago

smbslt3 commented 1 year ago

image image image

As a result of the topic modeling, I got 40 topics. And in the topic data frame, I can identify topics 10 and 13. You can also see them in the topic distance visualization, Hierarchical topic clustering analysis still shows these two topics.

image

However, when I run topic_model.visualize_hierarchical_documents(), I lose random topics. Whenever I change the number of topics while adjusting the superparameters, I lose 1-2 topics each time. At first I suspected that the number of documents in a topic was too small to be visualized, but even topics with fewer documents than topics 10 and 13 were visualized.

I followed the steps in the following article. https://maartengr.github.io/BERTopic/api/plotting/hierarchical_documents.html#bertopic.plotting._hierarchical_documents.visualize_hierarchical_documents In the above article, I don't see any missing topics.

What could be the problem and how can I fix it? I'm running with bertopic 0.15.0

MaartenGr commented 1 year ago

I am not entirely sure what is happening. Could you share your full code? Although the images are helpful, it is unclear where in the process they are generated and how the process of training looks like.

smbslt3 commented 1 year ago

Here is the code.


# 1. set embedding space
sentence_model = SentenceTransformer("jhgan/ko-sroberta-multitask", device='cuda:0')    # set embedding model
embeddings = sentence_model.encode(total_docs, show_progress_bar=True, batch_size=256)   # build embedding

# # 2. Train BERTopic and extract hierarchical topics
umap_model = UMAP(random_state=RANDOM_STATE,   # to fix the result
                 )

topic_model = BERTopic(
                         embedding_model=sentence_model,
                         vectorizer_model=vectorizer,
                         # nr_topics="auto",   # 토픽 수 자동으로 감지   # nr_topics="auto" 로 설정한 다음, 아래 시각화를 통해 토픽 분포를 확인하고 이후 토픽의 수를 축소하기
                         nr_topics = 40, #40,   # 전체 토픽 수 제한
                         top_n_words = 10,   # 사용할 상위 n개 단어
                         calculate_probabilities=True,
                         umap_model=umap_model,          # umap random state 고정
                      )

topics, probabilities = topic_model.fit_transform( total_docs )

---

topic_model.visualize_topics()    # 토픽 시각화

---

hierarchical_topics = topic_model.hierarchical_topics(total_docs)
fig_raw_hirachi = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

pio.write_image(fig_raw_hirachi, 'fig_raw_hirachi.png',scale=4)

fig_raw_hirachi

---

reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
fig_raw_dist = topic_model.visualize_hierarchical_documents(total_docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)

pio.write_image(fig_raw_dist, 'fig_raw_dist_reduced.png',scale=4)

fig_raw_dist

After running it multiple times, I figured out that topic_model.visualize_hierarchical_documents(docs, hierarchical_topics) automatically merges the closest two topics (the first two node of the cluster in hierarchical clustering, the 0 index of hierarchical_topics)

I'm wondering I gave same hierarchical_topics to topic_model.visualize_hierarchy() and topic_model.visualize_hierarchical_documents, but the problem only occur in the latter.

Even when changing the number of topics (by setting different nr_topics or using topic_model.merge_topics()), the problem I've mentioned above always occurs in the same way.

MaartenGr commented 1 year ago

I am not entirely sure what is happening here but it might be worthwhile to not set nr_topics at all and instead directly optimize the number of topics with min_topic_size instead. There indeed seems to be a problem with merging topics and this should make it a bit easier.

smbslt3 commented 1 year ago

image

  1. If you look closely at the capture, you can see what is happening. There are 19 topics in topic_model.get_topic_info(), numbered -1 through 17.

image

  1. In topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics), 17 topics except -1 are visualized normally. In this case, topics 0 and 10 are the two closest (first clustered) topics.

image

  1. Use topic_model.visualize_documents(total_docs, reduced_embeddings=reduced_embeddings) to visualize without hierarchical clustering. Here, 18 topics, numbered 0 to 17, are visualized normally.

image

  1. As a result of topic_model.visualize_hierarchical_documents(total_docs, hierarchical_topics, reduced_embeddings=reduced_embeddings), topics 0 and 10 disappear and 18 is created. These are the first two topics to cluster in 2.

MY Question: Is topic_model.visualize_hierarchical_documents() unable to visualize the distribution of non-clustered vectors at level 0?

By the way, what is the reason for recommending adjusting min_topic_size rather than directly specifying nr_topics?

MaartenGr commented 1 year ago

MY Question: Is topic_model.visualize_hierarchical_documents() unable to visualize the distribution of non-clustered vectors at level 0?

It should be able to visualize them and I am not sure why it skips over those topics especially if you did not manually change them after fitting.

By the way, what is the reason for recommending adjusting min_topic_size rather than directly specifying nr_topics?

This might be the reason why it shows a problem. With nr_topics, the topics are reduced after they are found which might have influenced the hierarchical topics. In contrast, min_topic_size directly influences the number of topics created without needing additional aggregation.

Whenever issues like these appear, it is prefered to minimize the additional parameters to see where the issue might result from.

One additional thing I noticed from your code. The UMAP model should be something like this instead:

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

The default settings are not particularly good out of the box for these kind of analyses.

Also, what is the vectorizer model that you created? It

MaartenGr commented 1 year ago

Aaah, I think I know what the issue is here! The level 0 that you created already shows a few topics merged which explains the ones that are "missing". They are not actually missing but it seems merged into new topics. I think you can circumvent this by setting a larger nr_levels when visualizing the topics.

smbslt3 commented 1 year ago

That's what I just mentioned above. Here is the comparison of two distribution visualization.

image

image

After I change the code to topic_model.visualize_hierarchical_documents(total_docs, hierarchical_topics, reduced_embeddings=reduced_embeddings, nr_levels=15), the result is still the same (still merge topic 0 and 10 on level 0).

the vectorizor is this, I made custom vectorizer to utilize Korean tokenizer


class CustomTokenizer:

    def __init__(self, tagger):
        self.tagger = tagger
    def __call__(self, sent):
        # sent = sent[:1000000]
        word_tokens = [t[0] for t in self.tagger.tokenize(sent) if t[1][0] in 'NV' or 'SL' in t[1] ]  # 토크나이저에서 체언/용언/외국어 만 가져오기
        word_tokens = [t[0] for t in self.tagger.tokenize(sent, stopwords=stopwords) if t[1][0] in 'NV' or 'SL' in t[1] ]  # stopwords 제외            
        # word_tokens = [t[0] for t in self.tagger.tokenize(sent) if t.tag[:2] in ['NNG', 'NNP', 'VV', 'VA', 'SL' ] ]  # 토크나이저에서 체언/용언/외국어 만 가져오기
        result = [word for word in word_tokens if len(word) > 1]
        return result

custom_tokenizer = CustomTokenizer(kiwi)
# custom_tokenizer = lambda x: [w for w in x.split() if len(w)>1]

vectorizer = CountVectorizer(tokenizer=custom_tokenizer, 
                             max_features=3000, min_df=5, 
                            )

Additionally I would appreciate your advice about UMAP. thanks.

MaartenGr commented 1 year ago

Then it might be indeed a bug. There might be something going on with how the "levels" are selected. I remember that it was resolved in a previous version but that does not seem the case. Although I do not expect it to help, it might be worthwhile to use level_scale="log" and see if that changes things when using .visualize_hierarchical_documents.

smbslt3 commented 1 year ago

Okay I'm fully understand it's unexpected bug. Thanks for developing and running Bertopic project. I hope this bug is not difficult to solve :)

MaartenGr commented 1 year ago

I just looked at it again but I am not sure what is triggering the bug since I cannot reproduce it. If somebody else runs into this issue, please share!

Virginie74 commented 1 year ago

Thank a lot for your work on Bertopic. It is a great package. I really appreciate the modularity and the quality of the documentation. I am also running in this issue. My clustering end up with 50 topics numbered from 0 to 49. When using .visualize_hierarchical_documents, the topics include in the parent topic 50 are merge together and don't show up on the figure. It is like level 0 and level 1 are the same whereas level 0 should corresponds to the topic create in the pipeline.

MaartenGr commented 1 year ago

@Virginie74 Do you by change have a reproducible example? I am not sure where in the code it is going wrong so making it reproducible would help in finding out what should be fixed.

Virginie74 commented 1 year ago

I can't show you any figure but it looks like the one above. I try to look at the code and I think the issu appear around the line 167:

for index, max_distance in enumerate(max_distances):

    # Get topics below `max_distance`
    mapping = {topic: topic for topic in df.topic.unique()}
    selection = hierarchical_topics.loc[hierarchical_topics.Distance <= max_distance, :]
    selection.Parent_ID = selection.Parent_ID.astype(int)
    selection = selection.sort_values("Parent_ID")

    for row in selection.iterrows():
        for topic in row[1].Topics:
            mapping[topic] = row[1].Parent_ID

    # Make sure the mappings are mapped 1:1
    mappings = [True for _ in mapping]
    while any(mappings):
        for i, (key, value) in enumerate(mapping.items()):
            if value in mapping.keys() and key != value:
                mapping[key] = mapping[value]
            else:
                mappings[i] = False

    # Create new column
    df[f"level_{index+1}"] = df.topic.map(mapping)
    df[f"level_{index+1}"] = df[f"level_{index+1}"].astype(int)

It is at this level that the columns with the level is created but there are not column created for level 0. I did not find the solution yet

Virginie74 commented 1 year ago

I took a closer look at the code. This allowed me to realize that there was no bug as such and that it was a problem of figure interpretation. At level 0, I expected to see all the topics created by the pipeline. But, if I've understood the code correctly, at level 0, we see the first level of the hierarchy: Topics whose distance is less than the first max_distances interval are assembled in their parent topic. Did I understand it correclty?

Thank again for the great work done with BERTopic!

MaartenGr commented 1 year ago

Ah yes, that should be indeed correct! I believe I did try to show it first without hierarchy by leveraging the distance structure (i.e., by setting the first distance to 1), however that does not seem to be a proper solution.