Closed A11en0 closed 2 years ago
Hello! :)
A few things:
The table shows the results averaged on different topics_num (you are using only n_components = 20, we tested 25, 50, 75, 100 and 150).
Consider using the parameters described in the paper (e.g., the embedding model should be different).
GoogleNews dataset comes already pre-processed
We used this colab to compute the results: https://colab.research.google.com/drive/1a7VSmHX7q_WTVnb-Tums2rRFhmGfVt2Z?usp=sharing, it is probably going to break because the package is now at version 2.2.2 but you should be able to get all the parameters
Hope this helps but let me know if you need more details :)
Thanks for your quick reply!
Hi!
c 25 -0.014539099943703226
c 50 0.11776184049289495
c 75 0.1501614001548399
c 100 0.18277287376999105
c 150 0.1902683488799876
NPMI coherence for 25 topics was very low. You are probably going to see improvements when you increase the number of topics.
If you sum those values and divide by 5 you get to something like 0.125
Yes! Note that in the colab we also use a different hidden layer setup
You are using WhiteSpacePreprocessing
in the code you shared, that automatically applies some preprocessing. We use the already preprocessed dataset (we directly wget it from the original repository).
OK, I'm trying to use your provide codes, and run 30 times. But something strange, one line has an error, how it works well?
where these two parameters are reversed.
yes, in version 2.0.0 (see here) we swapped those two parameters. You can pip install an older version or you can swap those two items :)
let me know if it does not work, I can update the colab notebook to a more recent version
Thanks, I'm running my codes, others are fine now, but the GPU can't load 100%, I don't know why.
my results in topic number as 50, with 30 times.
Can you share the entire script you are using?
I just run 10 iterations and the average is ~0.11 (that is close to the one in the paper). You can probably see the entire run in the colab.
Happy to take a look at your code if you can share it :)
Thanks for your careful testing, I made a different data loader and preparation from yours, perhaps the problem is there. I'll check it again later. But before that, I need to build my own model and then handle the slightly different results. Anyway, thank you very much. Great work!
Thanks a lot :) :)
let me know if you need help with the replication (I'll close the issue for now, but feel free to open a new one!)
Description
I can't reproduce the performance on the dataset GoogleNews, my testing NPMI score is about -0.05, but 0.12 in the paper ' Pre-training is a Hot Topic '.
What I Did
Here is my code with hyper parameters:
I set the num_epochs=100, n_components=20, batch_size=256, others default.
results