BERTopic n-gram words are not adjacent to each other

navidNickaan commented 11 months ago

After setting the ngram_range=(2,2), the trained BERTopic model generates topics with 2-gram phrases such as Topic_1: {"Model Router", "Network Setup", etc}, but the individual words of each 2-gram are not adjacent to each other within the document and they are far away form each other. It seems that the BERTopic model is not considering 2-gram at all. Is there any way to make sure that the individual words in the 2-gram phrases of each topic are not far away from each other within the related documents? I don't want BERTopic considers "Modem Router" as a 2-gram if there is no sentence in the whole document having "Modem" and "Router" words next to each other

MaartenGr commented 11 months ago

Could you share your full code? ngram_rang is dependent on more than just that single parameter. For example, the settings of the CountVectorizer might influence how they are being handled.

navidNickaan commented 11 months ago

I have performed some preprocessing on the documents, keeping only the 'ADJ', 'VERB', and 'NOUN' parts of each sentence while dropping the rest. Initially, each document represented a small sentence. To make the documents more informative and reduce their quantity , I combined every 5 sentences together ( reduced to approximately 3500 documents).

Here are the configurations :

cvm = CountVectorizer(stop_words="english", ngram_range=(2, 2)) umap_model = UMAP(n_neighbors=30, n_components=5, min_dist=0.05, random_state=325) ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True) representation_model = MaximalMarginalRelevance(diversity=0.2) model = BERTopic(verbose=True, top_n_words=5, umap_model=umap_model, nr_topics=5, ctfidf_model=ctfidf_model, vectorizer_model=cvm, representation_model=representation_model, min_topic_size=30 )

I have set the model to extract 5 topics, with each topic containing 5 words (in my case, five 2-grams). However, I am encountering three issues:

1- After training the model and generating topics, I noticed that when examining documents tagged with, for example, topic_1, the majority of them do not include any of the five 2-grams generated by the model. Only a few documents contain the 2-grams associated with topic_1. For example, the following document has been linked to topic_1 while none of the 2-grams are visible in the text: "menu status network address type address"

2- The model has generated topic_1 with the following 2-grams: "power cable," "internet cable," "modem router," "router modem," and "internet connection." I observed that "modem router" and "router modem" essentially represent the same concept, but with different word order. Is there any way to address and resolve this?

3- The main issue lies in the fact that documents linked to topic_1, have the individual words of 2-grams located far apart from each other. This misalignment contradicts the concept of n-grams. Here is one of the documents linked to topic_1: "light device power light internet make cable connect internet back" As you see, none of those 2-grams are visible in the text. "internet" and "cable" OR "power" and "cable" are not 2-grams Is there a possible solution or approach to address this issue?

MaartenGr commented 11 months ago

1- After training the model and generating topics, I noticed that when examining documents tagged with, for example, topic_1, the majority of them do not include any of the five 2-grams generated by the model. Only a few documents contain the 2-grams associated with topic_1. For example, the following document has been linked to topic_1 while none of the 2-grams are visible in the text: "menu status network address type address"

That is because you are performing additional preprocessing before extracting the n-grams. For example, the stopwords will be removed first in the countvectorizer before creating the ngrams.

2- The model has generated topic_1 with the following 2-grams: "power cable," "internet cable," "modem router," "router modem," and "internet connection." I observed that "modem router" and "router modem" essentially represent the same concept, but with different word order. Is there any way to address and resolve this?

You could increase de diversity parameter in MMR.

3- The main issue lies in the fact that documents linked to topic_1, have the individual words of 2-grams located far apart from each other. This misalignment contradicts the concept of n-grams. Here is one of the documents linked to topic_1: "light device power light internet make cable connect internet back" As you see, none of those 2-grams are visible in the text. "internet" and "cable" OR "power" and "cable" are not 2-grams Is there a possible solution or approach to address this issue?

If you want them to be near one another, you would have to skip preprocessing steps, like stopword removal before extracting the ngrams.

nkhajehzadeh commented 11 months ago

Thanks Maarten,

As I explained, I have done my own data pre-processing and I have dropped all the stop words before using BERTopic. I have used the list of stopwords provided by PyPi and nltk. Regarding your comment:

That is because you are performing additional preprocessing before extracting the n-grams. For example, the stopwords will be removed first in the countvectorizer before creating the ngrams.

I fail to see how excluding the stop words would resolve this issue. The only instance where I used preprocessing is in CountVectorizer(stop_words="english", ngram_range=(2, 2)). If I remove the parameter stop_words="english", there should be no change since none of the documents contain any stop words. Unless there is something I am overlooking?

Please let me know if there are any further suggestions or clarifications.

MaartenGr commented 11 months ago

What I meant was that whatever you do during the CountVectorizer step will influence how the ngrams are being handled. It can preprocess the data and therefore can change the structure of the documents. I would indeed not set stopwords in the CountVectorizer if it does not do anything. Other than that, as long as there is no additional processing in the CountVectorizer words should be adjacent.

MaartenGr / BERTopic

BERTopic n-gram words are not adjacent to each other #1671