MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.18k stars 765 forks source link

Model choices for short phrases (search queries) #1154

Closed gattaloukik123 closed 1 year ago

gattaloukik123 commented 1 year ago

I am working with finding topics from search queries that are very short in input length and sometimes the inputs are just words. I tried using the paraphrase-multilingual-MiniLM-L12-v2 as the embedding model and set all the other models to their default setting. I dont seem to get meaningful results and I get alot of numbers as the output. Is there a better way to deal with such inputs? Can you suggest me or guide me on this? @MaartenGr

MaartenGr commented 1 year ago

It depends. Could you share your full code as well as some of the output that you are getting? Also, can you share some representative documents for the topics that do not get meaningful results?

gattaloukik123 commented 1 year ago

Yeah sure, I am using the cuml packages

the code:

 topic_model = BERTopic(embedding_model=embedding_model,
                             vectorizer_model=vectorizer_model,
                             verbose=True,
                             umap_model=cuml_model.umap(),
                             hdbscan_model=cuml_model.hdbscan(),
                             calculate_probabilities=False)

a sample of the data:

283642                 swiss post solutions
78455     promotional culture is everywhere
372053                       helmsley group
169729                                 dazn
217358        Christopher Stanton died cbbc
65490                             ets 2 dlc
296677             tiny troopers global ops
373001                             echo dot
401798              eigene bilder verkaufen
53848                  billie jean cupcakke
4368                           laura robson
285977                                 Nisa
274859                manchester china town
396951               attitude vs perception
393703                            jorvik fm
435928                tallboy cupboard grey
423077        baumischabfall reinicken dorf
208625               how to hold a pc mouse
316395                       flavita lychee
70607               королева элизавета ltnb
Name: activity, dtype: object

This is just the sample data, please let me know if you need anything from my side.

MaartenGr commented 1 year ago

Could you also share an overview of the topics that were created? Based on what I see here, I can imagine the model will struggle to find sensible topics. Many of the documents in your data do not really make sense, so creating a good semantic representation will be difficult. For example, documents like Nisa or jorvik fm are exceedingly specific and I am struggling to understand some of the documents without more context.

Also, I am not sure without seeing the full code, but it appears that you are using HDBSCAN and UMAP without the default parameters:

umap_model=cuml_model.umap(),
hdbscan_model=cuml_model.hdbscan(),

Instead, I would advise using the default UMAP and HDBSCAN settings:

https://github.com/MaartenGr/BERTopic/blob/8fc11a2458d55adc5be4d4d9255baeeb485c35d5/bertopic/_bertopic.py#L210-L214

https://github.com/MaartenGr/BERTopic/blob/8fc11a2458d55adc5be4d4d9255baeeb485c35d5/bertopic/_bertopic.py#L218-L221

gattaloukik123 commented 1 year ago

Sure, this is the whole model config I used, l am not flagging the low_memory because I do have good GPUs with good RAM for inference. I just checked your comment on the default args for HDBSCAN, do you think the ones I use are not good? Can you guide me on this? (I need a way to get the frequencies of each word in the tf matrix, so I had to make a class of my own for the vectorizer funciton)

from typing import Tuple
from sklearn.feature_extraction.text import CountVectorizer
from cuml.manifold import UMAP
from cuml.preprocessing import normalize
from cuml.cluster import HDBSCAN

class Vectorizer_Model(CountVectorizer):
  def __init__(self):
    super().__init__()
    self.tf_matrix = None

  def get_frequencies(self):
    output = {}

    tf_matrix = self.tf_matrix.toarray().sum(axis=0)
    vocab = self.vocabulary_

    for key, value in vocab.items():
      output[key] = tf_matrix[value]

    return output

  def fit_transform(self, documents):
    X = super().fit_transform(documents)
    self.tf_matrix = X

    return X

topic_model = BERTopic(embedding_model = 'paraphrase-multilingual-MiniLM-L12-v2',
                        vectorizer_model=vectorizer_model,
                        verbose=True,
                        umap_model=UMAP(n_components=5, n_neighbors=15, min_dist=0.0),
                        hdbscan_model=HDBSCAN(min_samples=10, gen_min_span_tree=True),
                        calculate_probabilities=False)

And about the data, the reason you are not able to have more context is because they are search queries made by users. They dont have more context in this, but I want to understand the topics they dwell into. This is the challenge for me. You think the data is the problem here? I can share an csv file of the whole search terms if you want. Also since there are data from other languages I am using the multilingual model you guys have.

MaartenGr commented 1 year ago

Sure, this is the whole model config I used, l am not flagging the low_memory because I do have good GPUs with good RAM for inference.

I am not sure whether I read the code correctly but this seems to be quite different from what you shared before. There you initialized the UMAP model as cuml_model.umap() and the HDBSCAN model as cuml_model.hdbscan(). Was that pseudo-code you shared? If not, why is it different from what you shared above? Also, how did you create the vectorizer_model variable? Lastly, did you perform a straightforward topic_model.fit(docs) or was there any semi-supervised modeling involved there? It is important to share all of the code as the above is a bit unclear.

I just checked your comment on the default args for HDBSCAN, do you think the ones I use are not good? Can you guide me on this?

There are a few tweaks that are generally helpful such as setting n_components higher than 2, between 5 and 10. Based on the code above, I would advise using a cosine similarity in UMAP due to the high dimensionality of the embeddings.

And about the data, the reason you are not able to have more context is because they are search queries made by users. They dont have more context in this, but I want to understand the topics they dwell into. This is the challenge for me. You think the data is the problem here? I can share an csv file of the whole search terms if you want. Also since there are data from other languages I am using the multilingual model you guys have.

This is indeed quite difficult. Especially with such short queries that do not always make sense, I can imagine that the model struggles to find sensible topics. I think that indeed the data is the issue here. However, it might be that there are some topics more valuable than others, so it could be worthwhile to create some sort of method/procedure for checking whether a topic is good enough. Perhaps if they contain a sufficient amount of words, thereby assuming that enough context is available. All in all, I think it would be okay to assume that a big part of the data may not be possible to extract topics from and continue the analysis as such.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!