Closed Hctor99 closed 2 years ago
Hello! which kind of results are you getting?
This is the colab script we used to run the experiments (it will break because the version is not supported, but you can take the already preprocessed 20ng files and some parameters from there)
Thank you! I'll give it a try with the parameters you used :)
Hello again!
Just as a quick update: I tried rerunning the original collab script (adapting it to the newest version of CTM) but my results are still not good. Here is what I did: https://colab.research.google.com/drive/1pZFANqVn_Xfj7K1aeKLtuBea-h_0lv-E?usp=sharing
I ran the models for 20 epochs only since I found that running it for longer doesn't help at all. Using Palmetto, we obtained an NPMI score of 0.03, when LDA obtains a score of 0.05-0.08 using the same data.
Any help would be greatly appreciated :)
Just run a quick experiment with 20 and 25 topics here and coherence more or less resemble what we got in the paper. Note that we compute the coherence on the pre-processed corpus.
A few follow-up questions to better understand where the mismatch might be:
Hello again, sorry for the late response! I did indeed get the same results as you when using the script! But not for Palmetto. Palmetto uses the Wikipedia corpus. Doing it this way, the NPMI score for CTM is weaker than LDA (0.028 and 0.051 for 100 topics respectively). And this is the case for UCI, UMASS, C_p (we do obtain similar scores for C_v and C_a). It seems that CTM doesn't perform as well on external coherence metrics as LDA does. Would you have some lasts suggestions to improve the overall performance of CTM?
Thank you so much for taking your time to answer all my questions! :)
Hello, I think the different performances might be related to the different preprocessing for the training dataset and Wikipedia used by Palmetto. According to this https://github.com/dice-group/Palmetto/issues/33, if a word is not present in Palmetto-Wikipedia's vocabulary, Palmetto returns 0. My guess is that LDA returns topics whose words are not present in Wikipedia's vocabulary, increasing the NPMI coherence value (recall that NPMI ranges from -1 and 1).
In our previous experiments, we used pre-trained embeddings to compute a word-embedding-based external coherence (see the paper here), which is definitely more efficient than NPMI coherence computed on an external corpus. An alternative is to use Palmetto on a Wikipedia dataset pre-processed in the same way as the training dataset. I did it a few years ago, and I didn't find it easy :)
Hope this helps!
Silvia
Hi Sylvia,
Thank you for your answer!
I did the exact same processing and use the same vocabulary for both LDA and CTM. Also, since the NPMI scores are positive for both models (0.028 and 0.51), wouldn’t it be the case that a word not in Wikipedia would decrease rather than increase the value of NPMI? Then I’d imagine that CTM would obtain a lower NPMI if it was producing unknown words for Wikipedia. But is there any reason why one would expect CTM (or LDA) to produce more “rare” words, unknown to Wikipedia, if I’m using the same vocabulary?
Hello @Hctor99! :)
I think the point was which preprocessing has been run on the Wikipedia corpus used in Palmetto.
I still believe that the external coherence measure made with word embeddings is faster and more direct to use
Sounds good, thank you so much for all your help!
Description
Hello!
I've been trying to replicate your results using the 20 News Group dataset but I keep getting suboptimal results.
Here's the preprocessing I did. Following your paper, I removed punctuation, digits and nltk stop words:
And the rest of the code:
In general, my results are really low. In particular, they're lower than the ones we obtained for LDA (with the same preprocessing). We also tried running the models with 100 epochs and we didn't notice any difference. I tried using bert-base-uncased instead of SBERT but the results were low as well.
Do you have any idea of what I could be doing wrong?
Thanks so much for your help in advance! :)