MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.2k stars 146 forks source link

Can't reproduce the performance on the dataset GoogleNews #95

Closed A11en0 closed 2 years ago

A11en0 commented 2 years ago

Description

I can't reproduce the performance on the dataset GoogleNews, my testing NPMI score is about -0.05, but 0.12 in the paper ' Pre-training is a Hot Topic '.

What I Did

Here is my code with hyper parameters:

documents = [line.strip() for line in open(data_dir, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v1")
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

text = list(map(lambda x:x.split(), preprocessed_documents))

ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, num_epochs=100, n_components=20, batch_size=256)
ctm.fit(training_dataset)  # run the model

topics = ctm.get_topic_lists(15)

topic_diversity, npmi_score, cv_score, umass_score, uci_score, rbo_score = \
    evaluate(topic=topics, text=text, topk=10)   # I write a function by my self to evaluate the topic results.

I set the num_epochs=100, n_components=20, batch_size=256, others default.

results

image

image

vinid commented 2 years ago

Hello! :)

A few things:

We used this colab to compute the results: https://colab.research.google.com/drive/1a7VSmHX7q_WTVnb-Tums2rRFhmGfVt2Z?usp=sharing, it is probably going to break because the package is now at version 2.2.2 but you should be able to get all the parameters

Hope this helps but let me know if you need more details :)

A11en0 commented 2 years ago

Thanks for your quick reply!

  1. yes, it's the average score in the paper, but I guess it'll not differ too much in one testing.
  2. I use the default setting of the code and the same to your colab code except for the embedding model, I will try another one.
  3. I also pre-processed as you do, using pre-processed one to build bow and non pre-processed one to build sentence embedding.
vinid commented 2 years ago

Hi!

  1. sorry probably I was not clear: our average is the average of 30 runs for each n_components in [25, 50, 75, 100, 150]. If you look at the end of the colab you'll see some results for each n_components:
c 25 -0.014539099943703226
c 50 0.11776184049289495
c 75 0.1501614001548399
c 100 0.18277287376999105
c 150 0.1902683488799876

NPMI coherence for 25 topics was very low. You are probably going to see improvements when you increase the number of topics.

If you sum those values and divide by 5 you get to something like 0.125

  1. Yes! Note that in the colab we also use a different hidden layer setup

  2. You are using WhiteSpacePreprocessing in the code you shared, that automatically applies some preprocessing. We use the already preprocessed dataset (we directly wget it from the original repository).

A11en0 commented 2 years ago

OK, I'm trying to use your provide codes, and run 30 times. But something strange, one line has an error, how it works well?

image

where these two parameters are reversed.

vinid commented 2 years ago

yes, in version 2.0.0 (see here) we swapped those two parameters. You can pip install an older version or you can swap those two items :)

let me know if it does not work, I can update the colab notebook to a more recent version

A11en0 commented 2 years ago

Thanks, I'm running my codes, others are fine now, but the GPU can't load 100%, I don't know why.

image

A11en0 commented 2 years ago

image

my results in topic number as 50, with 30 times.

vinid commented 2 years ago

Can you share the entire script you are using?

vinid commented 2 years ago

I just run 10 iterations and the average is ~0.11 (that is close to the one in the paper). You can probably see the entire run in the colab.

Happy to take a look at your code if you can share it :)

A11en0 commented 2 years ago

Thanks for your careful testing, I made a different data loader and preparation from yours, perhaps the problem is there. I'll check it again later. But before that, I need to build my own model and then handle the slightly different results. Anyway, thank you very much. Great work!

vinid commented 2 years ago

Thanks a lot :) :)

let me know if you need help with the replication (I'll close the issue for now, but feel free to open a new one!)