Open PhilipMay opened 4 years ago
Hi @PhilipMay the examples are just examples, so for a real clustering examples, it might make sense to try different approaches.
If cosine similarity is used for clustering, I think you will not see a difference if you normalize vectors to unit length. You have cos_sim(u,v) == cos_sim(u / ||u||, v / ||v||)
Best Nils Reimers
Correct me if I'm wrong, but KMeans is a bad algo here altogether? KMeans metric is euclidean, high dimensions euclidean performs poorly. One way to improve this is UMAP(embeddings)
first to reduce dimensions and try to preserve topology. But even then, these use-cases lend better to cosine (especially in the case of *-nli-stsb
-trained models); so I'm not sure topology can be preserved for downstream euclidean? So it seems what you really want here is cosine-based clustering.
I realize applications/clustering.py is a toy example, with no intention of optimizing; but I'm wondering if it just won't work whatsoever and lead frustrated copy-pasta users, and might be worth replacing (keeping minimal) with something like KMedoids(metric='cosine') (sklearn_extra/kmedoids). Even just two line change in that file: import statement & kmeans => kmedoids(metric='cosine') (leave it to the users to figure out installing the package).
Again, I agree with keeping the toy a toy, no rabbit-holes. Incidentally I saw @nreimers used Agglomorative first, then switched to KMeans per a PR. You've had more success with Agglomorative here? Something worth us trying?
Aside: I went nuts with a custom auto-encoder, joint loss on (1) original embeddings mse; (2) cosine similarity to random other embeddings, to preserve cosine; (3) topic/cluster based on gensim.LDA topic assignment of original text. Then the whole trained auto-encoder is clusterer, and its embeddings are used for similarity. Woof.
@lefnire do you have example code for the part (3) of your custom auto-encoder? That's something I've been meaning to try, but have not yet been able to implement.
oh man, it's a doozie... I wouldn't use on it if I were you. I'm actually preferring to use KMedoids(metric='cosine')
(on a sample, too many rows won't fit in RAM). One sec, I'll put it up on gist or such
Something like this autoencoder example. But again, don't use that unless you know you want to. I recommend instead a simpler: kmedoids example
Good stuff! Thanks for sharing. I hope to give my own a shot this week.
@kevinmandich I just discovered sklearn clusterers' linkage
argument facepalm. Here's a simpler example than either above agglomorative example
Correct me if I'm wrong, but KMeans is a bad algo here altogether? KMeans metric is euclidean, high dimensions euclidean performs poorly. One way to improve this is
UMAP(embeddings)
first to reduce dimensions and try to preserve topology. But even then, these use-cases lend better to cosine (especially in the case of*-nli-stsb
-trained models); so I'm not sure topology can be preserved for downstream euclidean? So it seems what you really want here is cosine-based clustering.I realize applications/clustering.py is a toy example, with no intention of optimizing; but I'm wondering if it just won't work whatsoever and lead frustrated copy-pasta users, and might be worth replacing (keeping minimal) with something like KMedoids(metric='cosine') (sklearn_extra/kmedoids). Even just two line change in that file: import statement & kmeans => kmedoids(metric='cosine') (leave it to the users to figure out installing the package).
Again, I agree with keeping the toy a toy, no rabbit-holes. Incidentally I saw @nreimers used Agglomorative first, then switched to KMeans per a PR. You've had more success with Agglomorative here? Something worth us trying?
Aside: I went nuts with a custom auto-encoder, joint loss on (1) original embeddings mse; (2) cosine similarity to random other embeddings, to preserve cosine; (3) topic/cluster based on gensim.LDA topic assignment of original text. Then the whole trained auto-encoder is clusterer, and its embeddings are used for similarity. Woof.
Currently I'm working on a clustering task. I fine-tuned custom BERT using own STS-like and Contrast data sets (round-robin). I've created custom evaluator for evaluating clusters quality using my Contrast data set. I've grid searched through a bunch of different cluster algorithms and as for my corpus KMeans performed the best. KMedoids performed similarly. HDBSCAN showed decent results as well. I also want to try dimension reduction with UMAP, instead of PCA, prior to clustering. Also I want to give a try for Subspace-Clustering and Spherical K-means. Using auto-encoder is a great idea to my mind and I wonder, how large corpus do you have to train it?
@pashok3d my corpus for AE is 75k docs. The AE works pretty well actually, and its encodings allow for pretty solid KMeans/Agg/whatever. Updated my AE above - pretty hairy, so good luck. I found the agglomorative(precomputed(cosine)) works almost as well as AE + downstream clusterer for my case.
@kevinmandich I've got a cleaned up repo for clustering, autoencoding, etc at https://github.com/lefnire/ml-tools
I think the clustering example could be improved by calculating normalized unit vectors before doing the clustering. This is because you use the costinus distance for comparison and not the euclidean distance.
What do you think?