In a recent version of scikit-learn, I believe it was v1.3, HDBSCAN was implemented with base functionality. Considering scikit-learn is already a requirement of BERTopic it stands to reason to use that implementation instead of the original implementation since scikit-learn has more contributors. Moreover, common installation issues related to HDBSCAN might be alleviated with this.
There are a couple of issues worth mentioning:
Calculation of probabilities is if I'm not mistaken, not implemented in scikit-learn's HDBSCAN
A solution would be to use the cosine similarities as the default method of calculating probabilities
The feature set is smaller than the original implementation
Speed needs to be tested to identify whether this is worth it
Accuracy, whatever that means in this context, might also need some exploration
For those reading this, I'm interested to hear what you all think about this suggested change!
In a recent version of scikit-learn, I believe it was v1.3, HDBSCAN was implemented with base functionality. Considering scikit-learn is already a requirement of BERTopic it stands to reason to use that implementation instead of the original implementation since scikit-learn has more contributors. Moreover, common installation issues related to HDBSCAN might be alleviated with this.
There are a couple of issues worth mentioning:
For those reading this, I'm interested to hear what you all think about this suggested change!