Open smbslt3 opened 1 year ago
I am not entirely sure what is happening. Could you share your full code? Although the images are helpful, it is unclear where in the process they are generated and how the process of training looks like.
Here is the code.
# 1. set embedding space
sentence_model = SentenceTransformer("jhgan/ko-sroberta-multitask", device='cuda:0') # set embedding model
embeddings = sentence_model.encode(total_docs, show_progress_bar=True, batch_size=256) # build embedding
# # 2. Train BERTopic and extract hierarchical topics
umap_model = UMAP(random_state=RANDOM_STATE, # to fix the result
)
topic_model = BERTopic(
embedding_model=sentence_model,
vectorizer_model=vectorizer,
# nr_topics="auto", # 토픽 수 자동으로 감지 # nr_topics="auto" 로 설정한 다음, 아래 시각화를 통해 토픽 분포를 확인하고 이후 토픽의 수를 축소하기
nr_topics = 40, #40, # 전체 토픽 수 제한
top_n_words = 10, # 사용할 상위 n개 단어
calculate_probabilities=True,
umap_model=umap_model, # umap random state 고정
)
topics, probabilities = topic_model.fit_transform( total_docs )
---
topic_model.visualize_topics() # 토픽 시각화
---
hierarchical_topics = topic_model.hierarchical_topics(total_docs)
fig_raw_hirachi = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
pio.write_image(fig_raw_hirachi, 'fig_raw_hirachi.png',scale=4)
fig_raw_hirachi
---
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
fig_raw_dist = topic_model.visualize_hierarchical_documents(total_docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
pio.write_image(fig_raw_dist, 'fig_raw_dist_reduced.png',scale=4)
fig_raw_dist
After running it multiple times, I figured out that topic_model.visualize_hierarchical_documents(docs, hierarchical_topics)
automatically merges the closest two topics (the first two node of the cluster in hierarchical clustering, the 0 index of hierarchical_topics
)
I'm wondering I gave same hierarchical_topics
to topic_model.visualize_hierarchy()
and topic_model.visualize_hierarchical_documents
, but the problem only occur in the latter.
Even when changing the number of topics (by setting different nr_topics
or using topic_model.merge_topics()
), the problem I've mentioned above always occurs in the same way.
I am not entirely sure what is happening here but it might be worthwhile to not set nr_topics
at all and instead directly optimize the number of topics with min_topic_size
instead. There indeed seems to be a problem with merging topics and this should make it a bit easier.
topic_model.get_topic_info()
, numbered -1 through 17.topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
, 17 topics except -1 are visualized normally. In this case, topics 0 and 10 are the two closest (first clustered) topics.topic_model.visualize_documents(total_docs, reduced_embeddings=reduced_embeddings)
to visualize without hierarchical clustering. Here, 18 topics, numbered 0 to 17, are visualized normally.topic_model.visualize_hierarchical_documents(total_docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
, topics 0 and 10 disappear and 18 is created. These are the first two topics to cluster in 2. MY Question: Is topic_model.visualize_hierarchical_documents()
unable to visualize the distribution of non-clustered vectors at level 0?
By the way, what is the reason for recommending adjusting min_topic_size
rather than directly specifying nr_topics
?
MY Question: Is topic_model.visualize_hierarchical_documents() unable to visualize the distribution of non-clustered vectors at level 0?
It should be able to visualize them and I am not sure why it skips over those topics especially if you did not manually change them after fitting.
By the way, what is the reason for recommending adjusting min_topic_size rather than directly specifying nr_topics?
This might be the reason why it shows a problem. With nr_topics
, the topics are reduced after they are found which might have influenced the hierarchical topics. In contrast, min_topic_size
directly influences the number of topics created without needing additional aggregation.
Whenever issues like these appear, it is prefered to minimize the additional parameters to see where the issue might result from.
One additional thing I noticed from your code. The UMAP model should be something like this instead:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
The default settings are not particularly good out of the box for these kind of analyses.
Also, what is the vectorizer
model that you created? It
Aaah, I think I know what the issue is here! The level 0 that you created already shows a few topics merged which explains the ones that are "missing". They are not actually missing but it seems merged into new topics. I think you can circumvent this by setting a larger nr_levels
when visualizing the topics.
That's what I just mentioned above. Here is the comparison of two distribution visualization.
After I change the code to topic_model.visualize_hierarchical_documents(total_docs, hierarchical_topics, reduced_embeddings=reduced_embeddings, nr_levels=15)
, the result is still the same (still merge topic 0 and 10 on level 0).
the vectorizor is this, I made custom vectorizer to utilize Korean tokenizer
class CustomTokenizer:
def __init__(self, tagger):
self.tagger = tagger
def __call__(self, sent):
# sent = sent[:1000000]
word_tokens = [t[0] for t in self.tagger.tokenize(sent) if t[1][0] in 'NV' or 'SL' in t[1] ] # 토크나이저에서 체언/용언/외국어 만 가져오기
word_tokens = [t[0] for t in self.tagger.tokenize(sent, stopwords=stopwords) if t[1][0] in 'NV' or 'SL' in t[1] ] # stopwords 제외
# word_tokens = [t[0] for t in self.tagger.tokenize(sent) if t.tag[:2] in ['NNG', 'NNP', 'VV', 'VA', 'SL' ] ] # 토크나이저에서 체언/용언/외국어 만 가져오기
result = [word for word in word_tokens if len(word) > 1]
return result
custom_tokenizer = CustomTokenizer(kiwi)
# custom_tokenizer = lambda x: [w for w in x.split() if len(w)>1]
vectorizer = CountVectorizer(tokenizer=custom_tokenizer,
max_features=3000, min_df=5,
)
Additionally I would appreciate your advice about UMAP
. thanks.
Then it might be indeed a bug. There might be something going on with how the "levels" are selected. I remember that it was resolved in a previous version but that does not seem the case. Although I do not expect it to help, it might be worthwhile to use level_scale="log"
and see if that changes things when using .visualize_hierarchical_documents
.
Okay I'm fully understand it's unexpected bug. Thanks for developing and running Bertopic project. I hope this bug is not difficult to solve :)
I just looked at it again but I am not sure what is triggering the bug since I cannot reproduce it. If somebody else runs into this issue, please share!
Thank a lot for your work on Bertopic. It is a great package. I really appreciate the modularity and the quality of the documentation. I am also running in this issue. My clustering end up with 50 topics numbered from 0 to 49. When using .visualize_hierarchical_documents, the topics include in the parent topic 50 are merge together and don't show up on the figure. It is like level 0 and level 1 are the same whereas level 0 should corresponds to the topic create in the pipeline.
@Virginie74 Do you by change have a reproducible example? I am not sure where in the code it is going wrong so making it reproducible would help in finding out what should be fixed.
I can't show you any figure but it looks like the one above. I try to look at the code and I think the issu appear around the line 167:
for index, max_distance in enumerate(max_distances):
# Get topics below `max_distance`
mapping = {topic: topic for topic in df.topic.unique()}
selection = hierarchical_topics.loc[hierarchical_topics.Distance <= max_distance, :]
selection.Parent_ID = selection.Parent_ID.astype(int)
selection = selection.sort_values("Parent_ID")
for row in selection.iterrows():
for topic in row[1].Topics:
mapping[topic] = row[1].Parent_ID
# Make sure the mappings are mapped 1:1
mappings = [True for _ in mapping]
while any(mappings):
for i, (key, value) in enumerate(mapping.items()):
if value in mapping.keys() and key != value:
mapping[key] = mapping[value]
else:
mappings[i] = False
# Create new column
df[f"level_{index+1}"] = df.topic.map(mapping)
df[f"level_{index+1}"] = df[f"level_{index+1}"].astype(int)
It is at this level that the columns with the level is created but there are not column created for level 0. I did not find the solution yet
I took a closer look at the code. This allowed me to realize that there was no bug as such and that it was a problem of figure interpretation. At level 0, I expected to see all the topics created by the pipeline. But, if I've understood the code correctly, at level 0, we see the first level of the hierarchy: Topics whose distance is less than the first max_distances interval are assembled in their parent topic. Did I understand it correclty?
Thank again for the great work done with BERTopic!
Ah yes, that should be indeed correct! I believe I did try to show it first without hierarchy by leveraging the distance structure (i.e., by setting the first distance to 1), however that does not seem to be a proper solution.
As a result of the topic modeling, I got 40 topics. And in the topic data frame, I can identify topics 10 and 13. You can also see them in the topic distance visualization, Hierarchical topic clustering analysis still shows these two topics.
However, when I run topic_model.visualize_hierarchical_documents(), I lose random topics. Whenever I change the number of topics while adjusting the superparameters, I lose 1-2 topics each time. At first I suspected that the number of documents in a topic was too small to be visualized, but even topics with fewer documents than topics 10 and 13 were visualized.
I followed the steps in the following article. https://maartengr.github.io/BERTopic/api/plotting/hierarchical_documents.html#bertopic.plotting._hierarchical_documents.visualize_hierarchical_documents In the above article, I don't see any missing topics.
What could be the problem and how can I fix it? I'm running with bertopic 0.15.0