Can't reproduce the performance on the dataset GoogleNews

A11en0 commented 2 years ago

Contextualized Topic Models version: Latest
Python version: python3.7
Operating System: Linux

Description

I can't reproduce the performance on the dataset GoogleNews, my testing NPMI score is about -0.05, but 0.12 in the paper ' Pre-training is a Hot Topic '.

What I Did

Here is my code with hyper parameters:

documents = [line.strip() for line in open(data_dir, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v1")
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

text = list(map(lambda x:x.split(), preprocessed_documents))

ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, num_epochs=100, n_components=20, batch_size=256)
ctm.fit(training_dataset)  # run the model

topics = ctm.get_topic_lists(15)

topic_diversity, npmi_score, cv_score, umass_score, uci_score, rbo_score = \
    evaluate(topic=topics, text=text, topk=10)   # I write a function by my self to evaluate the topic results.

I set the num_epochs=100, n_components=20, batch_size=256, others default.

results

vinid commented 2 years ago

Hello! :)

A few things:

The table shows the results averaged on different topics_num (you are using only n_components = 20, we tested 25, 50, 75, 100 and 150).
Consider using the parameters described in the paper (e.g., the embedding model should be different).
GoogleNews dataset comes already pre-processed

We used this colab to compute the results: https://colab.research.google.com/drive/1a7VSmHX7q_WTVnb-Tums2rRFhmGfVt2Z?usp=sharing, it is probably going to break because the package is now at version 2.2.2 but you should be able to get all the parameters

Hope this helps but let me know if you need more details :)

A11en0 commented 2 years ago

Thanks for your quick reply!

yes, it's the average score in the paper, but I guess it'll not differ too much in one testing.
I use the default setting of the code and the same to your colab code except for the embedding model, I will try another one.
I also pre-processed as you do, using pre-processed one to build bow and non pre-processed one to build sentence embedding.

vinid commented 2 years ago

Hi!

sorry probably I was not clear: our average is the average of 30 runs for each n_components in [25, 50, 75, 100, 150]. If you look at the end of the colab you'll see some results for each n_components:

c 25 -0.014539099943703226
c 50 0.11776184049289495
c 75 0.1501614001548399
c 100 0.18277287376999105
c 150 0.1902683488799876

NPMI coherence for 25 topics was very low. You are probably going to see improvements when you increase the number of topics.

If you sum those values and divide by 5 you get to something like 0.125

Yes! Note that in the colab we also use a different hidden layer setup
You are using WhiteSpacePreprocessing in the code you shared, that automatically applies some preprocessing. We use the already preprocessed dataset (we directly wget it from the original repository).

A11en0 commented 2 years ago

OK, I'm trying to use your provide codes, and run 30 times. But something strange, one line has an error, how it works well?

where these two parameters are reversed.

vinid commented 2 years ago

yes, in version 2.0.0 (see here) we swapped those two parameters. You can pip install an older version or you can swap those two items :)

let me know if it does not work, I can update the colab notebook to a more recent version

A11en0 commented 2 years ago

Thanks, I'm running my codes, others are fine now, but the GPU can't load 100%, I don't know why.

A11en0 commented 2 years ago

my results in topic number as 50, with 30 times.

vinid commented 2 years ago

Can you share the entire script you are using?

vinid commented 2 years ago

I just run 10 iterations and the average is ~0.11 (that is close to the one in the paper). You can probably see the entire run in the colab.

Happy to take a look at your code if you can share it :)

A11en0 commented 2 years ago

Thanks for your careful testing, I made a different data loader and preparation from yours, perhaps the problem is there. I'll check it again later. But before that, I need to build my own model and then handle the slightly different results. Anyway, thank you very much. Great work!

vinid commented 2 years ago

Thanks a lot :) :)

let me know if you need help with the replication (I'll close the issue for now, but feel free to open a new one!)

MilaNLProc / contextualized-topic-models