UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.23k stars 2.47k forks source link

Finetune for "clustering" when we don't have exact positive/negative pairs #2936

Open HenningDinero opened 1 month ago

HenningDinero commented 1 month ago

When using the Triplet loss - we try to minimize the distance between each pair (a_1, p_1) while maximizing the distance between (a_1,p_j), j!=1.

I'm trying to solve the following; for given set of texts t1 = ["text about banking", "text about finance", "text about money laundry"] and t2= ["text about sport", "text about injuries", "text about running shoes"] create embeddings such that the embeddings for t1 are closer/ than for any in t2 i.e create embeddings which are clustered.

As far as I can see that is not "directly supported" - but is there a way around this? I could take each text in t2 as a hard-negative for each text in t1, but I can't figure out if there is a better approach, because we would still get a anchor/negative pair for each text in t1 i.e if I set a_1 ="Text about banking", p1="text about finance" then "text about money laundry" would be a negative for "text about banking" which it shouldn't be.

Note, there is this example https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py which shows to apply a model to create clusters - I want to fine-tune the model based on "clusters"

ir2718 commented 1 month ago

To me this sounds a lot like hierarchical classification where hyperbolic embeddings are often used. Have a look at this. You can partially aumatomate the process of creating labels using some existing sentence transformer model and hierarchical agglomerative clustering (and possibly manually relabel the mistakes). Since it seems you're working on some kind of topic modeling check out BERTopic, as it does something similar but including dimensionality reduction. Does this fit your use case?