ddangelov / Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.
BSD 3-Clause "New" or "Revised" License
2.94k stars 373 forks source link

Noise topic #346

Open ronirg opened 1 year ago

ronirg commented 1 year ago

Hi According to the paper: HDBSCAN assigns a label to each dense cluster of document vectors and assigns a noise label to all document vectors that are not in a dense cluster.

If a document was assigned to a noise label, will it be in Topic -1 or Topic 0? I cannot find it in the documentation. I don't get Topic -1 in my experiments.

Thanks

jacob-bayer commented 10 months ago

I had this question too. I think that topic 0 is noise but I'm not entirely sure. Maybe @ddangelov could weight in. I've found that if you look closely there are lots of other clusters that could be categorized as "noise" as well based on the top words. In my pipeline I look at proportion of topics that are missing the top 5 words from the topic_words, and if they have less than 2 of the top 5 words and confidence below 0.4 I call it an outlier. Then I look at the proportion of outliers for each cluster, and if it's mostly outliers I call it a noise cluster. That works for my data. It might not work for yours.