GU-DataLab / topic-noise-models-source

Mallet implementations of Topic-Noise Models
Apache License 2.0
3 stars 1 forks source link

Standalone use of Mallet TND #3

Closed Jenny060399 closed 1 year ago

Jenny060399 commented 1 year ago

First of all, thank you for your Topic Noise Models! I am currently trying to train TND in Windows (to be able to combine the noise distribution with existing topic models). However, I had difficulties with the Python package, probably due to Windows. That's why I tried to use the Java programme directly. To do this, I downloaded the mallet-tnd project from Github and imported it into my IDE. So far I can now train TND. However, I have questions about the parameters. First of all, which parameter of mallet-tnd corresponds to beta_1 (to determine whether a word is more of a topic word or a noise word)? I have seen that there is a constructor that takes the parameter "skew" which, if I understand it correctly, is related to beta_1. However, I am not sure how high I should set "skew". My data is very noisy (chats from messenger services). Furthermore, could you please explain to me how I get the noise distribution? Your implementation of "ParallelTopicModel" has the variable "noiseDistribution" (an int[] array). When you call "get_noise_distribution()" in Python, which variable exactly are you accessing? To the output of the function "printTopNoise"?

I would be very grateful for an answer! Many thanks in advance and best regards!

rchurch4 commented 1 year ago

Hi, thanks for using TND!

You are correct that the skew parameter is beta_1 on the inside.
As you increase beta_1, less noise is removed from topics. beta_1 is a positive integer, and in our experiments we tested values of 0, 9, 16, 25, 36, and 49. What is happening here is that you are essentially starting each word with a topic frequency of sqrt(beta_1) when you go to decide whether it is a noise word or not. I strongly advise not setting beta_1 to 0, because it will result in many topic words erroneously being identified as noise. We used beta_1 = 25 for most of our experiments on Twitter data and other short-form text data. If you find that there is still a lot of noise when using beta_1 = 25, try moving down to 16 and so forth. You can find the paper that explains the parameters here: https://www.churchill.io/papers/topic_noise_models.pdf

The function printTopNoise is the one that gets a nice list of noise words, but you can also access the full noise distribution and the frequencies by accessing the noiseDistribution variable directly.