UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.28k stars 2.48k forks source link

Improving the clustering example by using normalized unit vectors. #320

Open PhilipMay opened 4 years ago

PhilipMay commented 4 years ago

I think the clustering example could be improved by calculating normalized unit vectors before doing the clustering. This is because you use the costinus distance for comparison and not the euclidean distance.

What do you think?

nreimers commented 4 years ago

Hi @PhilipMay the examples are just examples, so for a real clustering examples, it might make sense to try different approaches.

If cosine similarity is used for clustering, I think you will not see a difference if you normalize vectors to unit length. You have cos_sim(u,v) == cos_sim(u / ||u||, v / ||v||)

Best Nils Reimers

lefnire commented 4 years ago

Correct me if I'm wrong, but KMeans is a bad algo here altogether? KMeans metric is euclidean, high dimensions euclidean performs poorly. One way to improve this is UMAP(embeddings) first to reduce dimensions and try to preserve topology. But even then, these use-cases lend better to cosine (especially in the case of *-nli-stsb-trained models); so I'm not sure topology can be preserved for downstream euclidean? So it seems what you really want here is cosine-based clustering.

I realize applications/clustering.py is a toy example, with no intention of optimizing; but I'm wondering if it just won't work whatsoever and lead frustrated copy-pasta users, and might be worth replacing (keeping minimal) with something like KMedoids(metric='cosine') (sklearn_extra/kmedoids). Even just two line change in that file: import statement & kmeans => kmedoids(metric='cosine') (leave it to the users to figure out installing the package).

Again, I agree with keeping the toy a toy, no rabbit-holes. Incidentally I saw @nreimers used Agglomorative first, then switched to KMeans per a PR. You've had more success with Agglomorative here? Something worth us trying?

Aside: I went nuts with a custom auto-encoder, joint loss on (1) original embeddings mse; (2) cosine similarity to random other embeddings, to preserve cosine; (3) topic/cluster based on gensim.LDA topic assignment of original text. Then the whole trained auto-encoder is clusterer, and its embeddings are used for similarity. Woof.

kevinmandich commented 4 years ago

@lefnire do you have example code for the part (3) of your custom auto-encoder? That's something I've been meaning to try, but have not yet been able to implement.

lefnire commented 4 years ago

oh man, it's a doozie... I wouldn't use on it if I were you. I'm actually preferring to use KMedoids(metric='cosine') (on a sample, too many rows won't fit in RAM). One sec, I'll put it up on gist or such

lefnire commented 4 years ago

Something like this autoencoder example. But again, don't use that unless you know you want to. I recommend instead a simpler: kmedoids example

kevinmandich commented 4 years ago

Good stuff! Thanks for sharing. I hope to give my own a shot this week.

lefnire commented 4 years ago

@kevinmandich I just discovered sklearn clusterers' linkage argument facepalm. Here's a simpler example than either above agglomorative example

pashok3d commented 4 years ago

Correct me if I'm wrong, but KMeans is a bad algo here altogether? KMeans metric is euclidean, high dimensions euclidean performs poorly. One way to improve this is UMAP(embeddings) first to reduce dimensions and try to preserve topology. But even then, these use-cases lend better to cosine (especially in the case of *-nli-stsb-trained models); so I'm not sure topology can be preserved for downstream euclidean? So it seems what you really want here is cosine-based clustering.

I realize applications/clustering.py is a toy example, with no intention of optimizing; but I'm wondering if it just won't work whatsoever and lead frustrated copy-pasta users, and might be worth replacing (keeping minimal) with something like KMedoids(metric='cosine') (sklearn_extra/kmedoids). Even just two line change in that file: import statement & kmeans => kmedoids(metric='cosine') (leave it to the users to figure out installing the package).

Again, I agree with keeping the toy a toy, no rabbit-holes. Incidentally I saw @nreimers used Agglomorative first, then switched to KMeans per a PR. You've had more success with Agglomorative here? Something worth us trying?

Aside: I went nuts with a custom auto-encoder, joint loss on (1) original embeddings mse; (2) cosine similarity to random other embeddings, to preserve cosine; (3) topic/cluster based on gensim.LDA topic assignment of original text. Then the whole trained auto-encoder is clusterer, and its embeddings are used for similarity. Woof.

Currently I'm working on a clustering task. I fine-tuned custom BERT using own STS-like and Contrast data sets (round-robin). I've created custom evaluator for evaluating clusters quality using my Contrast data set. I've grid searched through a bunch of different cluster algorithms and as for my corpus KMeans performed the best. KMedoids performed similarly. HDBSCAN showed decent results as well. I also want to try dimension reduction with UMAP, instead of PCA, prior to clustering. Also I want to give a try for Subspace-Clustering and Spherical K-means. Using auto-encoder is a great idea to my mind and I wonder, how large corpus do you have to train it?

lefnire commented 4 years ago

@pashok3d my corpus for AE is 75k docs. The AE works pretty well actually, and its encodings allow for pretty solid KMeans/Agg/whatever. Updated my AE above - pretty hairy, so good luck. I found the agglomorative(precomputed(cosine)) works almost as well as AE + downstream clusterer for my case.

lefnire commented 4 years ago

@kevinmandich I've got a cleaned up repo for clustering, autoencoding, etc at https://github.com/lefnire/ml-tools