Closed Jenny060399 closed 1 year ago
Hi, thanks for using TND!
You are correct that the skew
parameter is beta_1 on the inside.
As you increase beta_1
, less noise is removed from topics. beta_1
is a positive integer, and in our experiments we tested values of 0, 9, 16, 25, 36, and 49
. What is happening here is that you are essentially starting each word with a topic frequency of sqrt(beta_1)
when you go to decide whether it is a noise word or not. I strongly advise not setting beta_1 to 0, because it will result in many topic words erroneously being identified as noise. We used beta_1 = 25
for most of our experiments on Twitter data and other short-form text data. If you find that there is still a lot of noise when using beta_1 = 25
, try moving down to 16
and so forth.
You can find the paper that explains the parameters here: https://www.churchill.io/papers/topic_noise_models.pdf
The function printTopNoise
is the one that gets a nice list of noise words, but you can also access the full noise distribution and the frequencies by accessing the noiseDistribution
variable directly.
First of all, thank you for your Topic Noise Models! I am currently trying to train TND in Windows (to be able to combine the noise distribution with existing topic models). However, I had difficulties with the Python package, probably due to Windows. That's why I tried to use the Java programme directly. To do this, I downloaded the mallet-tnd project from Github and imported it into my IDE. So far I can now train TND. However, I have questions about the parameters. First of all, which parameter of mallet-tnd corresponds to beta_1 (to determine whether a word is more of a topic word or a noise word)? I have seen that there is a constructor that takes the parameter "skew" which, if I understand it correctly, is related to beta_1. However, I am not sure how high I should set "skew". My data is very noisy (chats from messenger services). Furthermore, could you please explain to me how I get the noise distribution? Your implementation of "ParallelTopicModel" has the variable "noiseDistribution" (an int[] array). When you call "get_noise_distribution()" in Python, which variable exactly are you accessing? To the output of the function "printTopNoise"?
I would be very grateful for an answer! Many thanks in advance and best regards!