MaartenGr / BERTopic_evaluation

Code and experiments for *BERTopic: Neural topic modeling with a class-based TF-IDF procedure*
MIT License
65 stars 32 forks source link

replacing new line characters #10

Open ptear opened 1 year ago

ptear commented 1 year ago

Hi Maarten,

I was just wondering what the reason is for following a different procedure for replacing \n characters with the UN dataset versus the Trump dataset https://github.com/MaartenGr/BERTopic_evaluation/blob/main/evaluation/data.py#L227.

I guess it has something to do with the longer length of the UN documents, being from debates as opposed to short form tweets. But what benefit does indicating new paragraphs with \p have compared to just a space?

Thanks for your efforts on BERTopic.

MaartenGr commented 1 year ago

It has been a while since I created that specific code but I remember there were issues with parsing that specific dataset which needed to have \n characters removed. It might also indeed be related to the length of the documents since sentence-transformers as a backend was used here.

I should note though that BERTopic has improved considerably since this was written. Using BERTopic together with MMR, KeyBERTInspired, or PartOfSpeech generally improves coherence scores quite a bit. So if you are looking to reproduce the results, it might be interesting to see what happens when you use one or more of the above representation models.

Using a generative LLM is especially interesting/fun but that does not allow for evaluation with coherence-like measures.