replacing new line characters

MaartenGr / BERTopic_evaluation

Code and experiments for *BERTopic: Neural topic modeling with a class-based TF-IDF procedure*

MIT License

65 stars 32 forks source link

It has been a while since I created that specific code but I remember there were issues with parsing that specific dataset which needed to have \n characters removed. It might also indeed be related to the length of the documents since sentence-transformers as a backend was used here.

I should note though that BERTopic has improved considerably since this was written. Using BERTopic together with MMR, KeyBERTInspired, or PartOfSpeech generally improves coherence scores quite a bit. So if you are looking to reproduce the results, it might be interesting to see what happens when you use one or more of the above representation models.

Using a generative LLM is especially interesting/fun but that does not allow for evaluation with coherence-like measures.

MaartenGr / BERTopic_evaluation

replacing new line characters #10