Closed gattaloukik123 closed 1 year ago
It depends. Could you share your full code as well as some of the output that you are getting? Also, can you share some representative documents for the topics that do not get meaningful results?
Yeah sure, I am using the cuml packages
the code:
topic_model = BERTopic(embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
verbose=True,
umap_model=cuml_model.umap(),
hdbscan_model=cuml_model.hdbscan(),
calculate_probabilities=False)
a sample of the data:
283642 swiss post solutions
78455 promotional culture is everywhere
372053 helmsley group
169729 dazn
217358 Christopher Stanton died cbbc
65490 ets 2 dlc
296677 tiny troopers global ops
373001 echo dot
401798 eigene bilder verkaufen
53848 billie jean cupcakke
4368 laura robson
285977 Nisa
274859 manchester china town
396951 attitude vs perception
393703 jorvik fm
435928 tallboy cupboard grey
423077 baumischabfall reinicken dorf
208625 how to hold a pc mouse
316395 flavita lychee
70607 королева элизавета ltnb
Name: activity, dtype: object
This is just the sample data, please let me know if you need anything from my side.
Could you also share an overview of the topics that were created? Based on what I see here, I can imagine the model will struggle to find sensible topics. Many of the documents in your data do not really make sense, so creating a good semantic representation will be difficult. For example, documents like Nisa
or jorvik fm
are exceedingly specific and I am struggling to understand some of the documents without more context.
Also, I am not sure without seeing the full code, but it appears that you are using HDBSCAN and UMAP without the default parameters:
umap_model=cuml_model.umap(),
hdbscan_model=cuml_model.hdbscan(),
Instead, I would advise using the default UMAP and HDBSCAN settings:
Sure, this is the whole model config I used, l am not flagging the low_memory because I do have good GPUs with good RAM for inference. I just checked your comment on the default args for HDBSCAN, do you think the ones I use are not good? Can you guide me on this? (I need a way to get the frequencies of each word in the tf matrix, so I had to make a class of my own for the vectorizer funciton)
from typing import Tuple
from sklearn.feature_extraction.text import CountVectorizer
from cuml.manifold import UMAP
from cuml.preprocessing import normalize
from cuml.cluster import HDBSCAN
class Vectorizer_Model(CountVectorizer):
def __init__(self):
super().__init__()
self.tf_matrix = None
def get_frequencies(self):
output = {}
tf_matrix = self.tf_matrix.toarray().sum(axis=0)
vocab = self.vocabulary_
for key, value in vocab.items():
output[key] = tf_matrix[value]
return output
def fit_transform(self, documents):
X = super().fit_transform(documents)
self.tf_matrix = X
return X
topic_model = BERTopic(embedding_model = 'paraphrase-multilingual-MiniLM-L12-v2',
vectorizer_model=vectorizer_model,
verbose=True,
umap_model=UMAP(n_components=5, n_neighbors=15, min_dist=0.0),
hdbscan_model=HDBSCAN(min_samples=10, gen_min_span_tree=True),
calculate_probabilities=False)
And about the data, the reason you are not able to have more context is because they are search queries made by users. They dont have more context in this, but I want to understand the topics they dwell into. This is the challenge for me. You think the data is the problem here? I can share an csv file of the whole search terms if you want. Also since there are data from other languages I am using the multilingual model you guys have.
Sure, this is the whole model config I used, l am not flagging the low_memory because I do have good GPUs with good RAM for inference.
I am not sure whether I read the code correctly but this seems to be quite different from what you shared before. There you initialized the UMAP model as cuml_model.umap()
and the HDBSCAN model as cuml_model.hdbscan()
. Was that pseudo-code you shared? If not, why is it different from what you shared above? Also, how did you create the vectorizer_model
variable? Lastly, did you perform a straightforward topic_model.fit(docs)
or was there any semi-supervised modeling involved there? It is important to share all of the code as the above is a bit unclear.
I just checked your comment on the default args for HDBSCAN, do you think the ones I use are not good? Can you guide me on this?
There are a few tweaks that are generally helpful such as setting n_components
higher than 2, between 5 and 10. Based on the code above, I would advise using a cosine similarity in UMAP due to the high dimensionality of the embeddings.
And about the data, the reason you are not able to have more context is because they are search queries made by users. They dont have more context in this, but I want to understand the topics they dwell into. This is the challenge for me. You think the data is the problem here? I can share an csv file of the whole search terms if you want. Also since there are data from other languages I am using the multilingual model you guys have.
This is indeed quite difficult. Especially with such short queries that do not always make sense, I can imagine that the model struggles to find sensible topics. I think that indeed the data is the issue here. However, it might be that there are some topics more valuable than others, so it could be worthwhile to create some sort of method/procedure for checking whether a topic is good enough. Perhaps if they contain a sufficient amount of words, thereby assuming that enough context is available. All in all, I think it would be okay to assume that a big part of the data may not be possible to extract topics from and continue the analysis as such.
Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!
I am working with finding topics from search queries that are very short in input length and sometimes the inputs are just words. I tried using the
paraphrase-multilingual-MiniLM-L12-v2
as the embedding model and set all the other models to their default setting. I dont seem to get meaningful results and I get alot of numbers as the output. Is there a better way to deal with such inputs? Can you suggest me or guide me on this? @MaartenGr