Closed amirmohammadkz closed 2 years ago
Hello @amirmohammadkz!
you would need to clean the documents in advance. If you look here kitty allows you to input a list of stopwords to remove.
You can use gensim's phrases to create the bigrams, but the bigrammed text is going to be used by the embedding model and this might be a suboptimal option. However, it is definitely worth a try. You could also manually train a ZeroShotTM model (with your custom pre-processing9 and then initialize a Kitty object with it.
Let me know if this helps!
Hello @vinid,
Thanks for answering my question.
you would need to clean the documents in advance. If you look here kitty allows you to input a list of stopwords to remove.
Oh, I see. Based on the Kitty example, since you have downloaded the nltk stopwords, I assumed the model uses it itself.
You can use gensim's phrases to create the bigrams, but the bigrammed text is going to be used by the embedding model and this might be a suboptimal option. However, it is definitely worth a try. You could also manually train a ZeroShotTM model (with your custom pre-processing9 and then initialize a Kitty object with it.
So, let me ask some questions for clarification. The Kitty model uses the ZeroShotTM. Correct? So the result of running Kitty example and ZeroShotTM example would be the same (since the same random seed is determined for both). Correct?
By using the bi-grams, SentenceBERT models will not detect bigrams as known tokens. Correct?
And by manually training a ZeroShotTm with custom preprocessing, I think you are suggesting me to implement something like WhiteSpacePreprocessing: Correct?
Thanks in advance for the further clarification.
Oh, I see. Based on the Kitty example, since you have downloaded the nltk stopwords, I assumed the model uses it itself.
yea, you are right, the doc is not updated. I need to fix this. If you look at the colab, you see that we pip install version 2.2.0 and not 2.2.1, that is the one that has this fix
The Kitty model uses the ZeroShotTM. Correct? So the result of running Kitty example and ZeroShotTM example would be the same (since the same random seed is determined for both). Correct?
yep, that's right! if you open the kitty code you'll see that's just a wrapper over ZeroShotTM (you can build a custom kitty with a few lines of code)
By using the bi-grams, SentenceBERT models will not detect bigrams as known tokens. Correct?
yes, unfortunately, that's the issue. This is why I suggest the "custom" thing
And by manually training a ZeroShotTm with custom preprocessing, I think you are suggesting me to implement something like WhiteSpacePreprocessing:
Exactly, that's the best way to implement your own custom behavior
Appy to accept a pull request to address this issue if you have time to write some code for this :)
Description
I want to optimize the number of topics, and to do that, I implemented External Word Embeddings Topic Coherence (alpha) metric which you recommended (https://github.com/MilaNLProc/contextualized-topic-models/issues/93#issuecomment-949579725) and introduced in your paper. Scores in the picture are average cosine distance so I pick the model with the least score.
Regarding the result I got from the model, I have some problems:
Is there any way to omit words such as conjunctions/ helping verbs/ etc which do not give us much information about the topic both in the training and prediction phase? I know that after the training phase I can get more than 5 words per topic and filter unwanted word categories, but in that case, the model still distinguishes topics based on the unwanted words.
In some cases, bi-grams are needed so the result makes sense. Suppose I get "get" in one of the word classes. It might be "get up", "get on", or other phrasal verbs that make completely different meanings than the solitary "get". Is there any way to resolve this issue?
What I Did
This is the topic coherence calculator I implemented for a Kitty word classes result:
And this is what I got after testing [3,5,7,9,11] topics: