MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.21k stars 147 forks source link

Getting suboptimal results for 20NG #111

Closed Hctor99 closed 2 years ago

Hctor99 commented 2 years ago

Description

Hello!

I've been trying to replicate your results using the 20 News Group dataset but I keep getting suboptimal results.

Here's the preprocessing I did. Following your paper, I removed punctuation, digits and nltk stop words:

text_file = "20N_Unprocessed.txt" # EDIT THIS WITH THE FILE YOU UPLOAD
documents = [line.strip() for line in open(text_file, encoding="ISO-8859-1").readlines()]
documents = [re.sub(r'[^a-z ]+', '', line.lower()) for line in documents]
stopwords = list(stop_words.words("english"))
sp = WhiteSpacePreprocessingStopwords(documents, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()

And the rest of the code:

tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v1")
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=100, batch_size=200, num_epochs=20)
ctm.fit(training_dataset) # run the model

In general, my results are really low. In particular, they're lower than the ones we obtained for LDA (with the same preprocessing). We also tried running the models with 100 epochs and we didn't notice any difference. I tried using bert-base-uncased instead of SBERT but the results were low as well.

Do you have any idea of what I could be doing wrong?

Thanks so much for your help in advance! :)

vinid commented 2 years ago

Hello! which kind of results are you getting?

This is the colab script we used to run the experiments (it will break because the version is not supported, but you can take the already preprocessed 20ng files and some parameters from there)

Hctor99 commented 2 years ago

Thank you! I'll give it a try with the parameters you used :)

Hctor99 commented 2 years ago

Hello again!

Just as a quick update: I tried rerunning the original collab script (adapting it to the newest version of CTM) but my results are still not good. Here is what I did: https://colab.research.google.com/drive/1pZFANqVn_Xfj7K1aeKLtuBea-h_0lv-E?usp=sharing

I ran the models for 20 epochs only since I found that running it for longer doesn't help at all. Using Palmetto, we obtained an NPMI score of 0.03, when LDA obtains a score of 0.05-0.08 using the same data.

Any help would be greatly appreciated :)

vinid commented 2 years ago

Just run a quick experiment with 20 and 25 topics here and coherence more or less resemble what we got in the paper. Note that we compute the coherence on the pre-processed corpus.

A few follow-up questions to better understand where the mismatch might be:

Hctor99 commented 2 years ago

Hello again, sorry for the late response! I did indeed get the same results as you when using the script! But not for Palmetto. Palmetto uses the Wikipedia corpus. Doing it this way, the NPMI score for CTM is weaker than LDA (0.028 and 0.051 for 100 topics respectively). And this is the case for UCI, UMASS, C_p (we do obtain similar scores for C_v and C_a). It seems that CTM doesn't perform as well on external coherence metrics as LDA does. Would you have some lasts suggestions to improve the overall performance of CTM?

Thank you so much for taking your time to answer all my questions! :)

silviatti commented 2 years ago

Hello, I think the different performances might be related to the different preprocessing for the training dataset and Wikipedia used by Palmetto. According to this https://github.com/dice-group/Palmetto/issues/33, if a word is not present in Palmetto-Wikipedia's vocabulary, Palmetto returns 0. My guess is that LDA returns topics whose words are not present in Wikipedia's vocabulary, increasing the NPMI coherence value (recall that NPMI ranges from -1 and 1).

In our previous experiments, we used pre-trained embeddings to compute a word-embedding-based external coherence (see the paper here), which is definitely more efficient than NPMI coherence computed on an external corpus. An alternative is to use Palmetto on a Wikipedia dataset pre-processed in the same way as the training dataset. I did it a few years ago, and I didn't find it easy :)

Hope this helps!

Silvia

Hctor99 commented 2 years ago

Hi Sylvia,

Thank you for your answer!

I did the exact same processing and use the same vocabulary for both LDA and CTM. Also, since the NPMI scores are positive for both models (0.028 and 0.51), wouldn’t it be the case that a word not in Wikipedia would decrease rather than increase the value of NPMI? Then I’d imagine that CTM would obtain a lower NPMI if it was producing unknown words for Wikipedia. But is there any reason why one would expect CTM (or LDA) to produce more “rare” words, unknown to Wikipedia, if I’m using the same vocabulary?

vinid commented 2 years ago

Hello @Hctor99! :)

I think the point was which preprocessing has been run on the Wikipedia corpus used in Palmetto.

I still believe that the external coherence measure made with word embeddings is faster and more direct to use

Hctor99 commented 2 years ago

Sounds good, thank you so much for all your help!