UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.18k stars 2.47k forks source link

Clustering similar sentences #1386

Open saurabh0512 opened 2 years ago

saurabh0512 commented 2 years ago

Hi, I have a problem statement which need similar sentences to be grouped together. Can I use clustering algorithms like DBSCAN, HDBSAN to cluster the embeddings together?

nreimers commented 2 years ago

Yes, DBSCan and HDBSCAN can be used.

Also have a look at: https://www.sbert.net/examples/applications/clustering/README.html

saurabh0512 commented 2 years ago

Hi, thanks for quick reply. I have two follow-up questions. Since absolute value of cosine similarity cannot tell us anything about similarity of two sentences, how can we justify using clustering algorithms? Also how to bring embeddings of lexically diverse sentences nearer, since right now clustering is showing the sentences have token overlaps?

julianStreibel commented 2 years ago

Maybe take a look at https://www.sbert.net/docs/quickstart.html#comparing-sentence-similarities and https://www.sbert.net/examples/training/sts/README.html.