Open ronirg opened 1 year ago
I had this question too. I think that topic 0 is noise but I'm not entirely sure. Maybe @ddangelov could weight in. I've found that if you look closely there are lots of other clusters that could be categorized as "noise" as well based on the top words. In my pipeline I look at proportion of topics that are missing the top 5 words from the topic_words
, and if they have less than 2 of the top 5 words and confidence below 0.4 I call it an outlier. Then I look at the proportion of outliers for each cluster, and if it's mostly outliers I call it a noise cluster. That works for my data. It might not work for yours.
Hi According to the paper: HDBSCAN assigns a label to each dense cluster of document vectors and assigns a noise label to all document vectors that are not in a dense cluster.
If a document was assigned to a noise label, will it be in Topic -1 or Topic 0? I cannot find it in the documentation. I don't get Topic -1 in my experiments.
Thanks