MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

Issue when using n_gram_range other than (1,1) #5

Closed ColinFerguson closed 3 years ago

ColinFerguson commented 3 years ago

Hi, really nice work with this package, it's very useful.

Model initiation takes the arguement n_gram_range, but I think that it doesn't get used. Should line 241 referenced here be count = CountVectorizer(ngram_range=n_gram_range, stop_words="english").fit(documents)?

https://github.com/MaartenGr/BERTopic/blob/9f7dca1103e1935f7a2779d1fa9e89db072c0c8a/bertopic/model.py#L241

It might be nice to have the stop_words argument be configurable at initiation as well, so that the user could pass a corpus-specific set of stop words.

MaartenGr commented 3 years ago

You are correct! Stupid overview on my part not actually using the n_gram_range. Same with stopwords.

MaartenGr commented 3 years ago

Master has the most up-to-date version. Pypi was updated to 0.2.3 to include the changes you proposed. Let me know if you find any other issues!

ColinFerguson commented 3 years ago

Great thank you so much @MaartenGr