MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Fix regex matching being used in PartOfSpeech representation model #2138

Closed woranov closed 2 months ago

woranov commented 2 months ago

What does this PR do?

By default, pandas' str.contains interprets the argument as a regex pattern. This causes the PartOfSpeech model to error out if the topics contain e.g. mismatched parentheses. This PR fixes that issue.

Fixes #2153.

Before submitting

MaartenGr commented 2 months ago

Thank you for the PR. I didn't see an open issue attached which is generally the procedure when opening up a PR. Could you add it?

With respect to the suggested change, do you have any example of what it fixes? I'm not sure I understand in wich specific scenarios the issue would happen.

woranov commented 2 months ago

Apologies! Indeed there are some preconditions to be met for the error to arise.

Opened an issue with a reproduction example in #2153.

MaartenGr commented 2 months ago

Thanks for the update! This seems good to me, let's merge 😄