Open HenningDinero opened 1 month ago
To me this sounds a lot like hierarchical classification where hyperbolic embeddings are often used. Have a look at this. You can partially aumatomate the process of creating labels using some existing sentence transformer model and hierarchical agglomerative clustering (and possibly manually relabel the mistakes). Since it seems you're working on some kind of topic modeling check out BERTopic, as it does something similar but including dimensionality reduction. Does this fit your use case?
When using the Triplet loss - we try to minimize the distance between each pair
(a_1, p_1)
while maximizing the distance between(a_1,p_j), j!=1
.I'm trying to solve the following; for given set of texts
t1 = ["text about banking", "text about finance", "text about money laundry"]
andt2= ["text about sport", "text about injuries", "text about running shoes"]
create embeddings such that the embeddings fort1
are closer/ than for any int2
i.e create embeddings which are clustered.As far as I can see that is not "directly supported" - but is there a way around this? I could take each text in
t2
as a hard-negative for each text int1
, but I can't figure out if there is a better approach, because we would still get a anchor/negative pair for each text int1
i.e if I seta_1 ="Text about banking", p1="text about finance"
then"text about money laundry"
would be a negative for"text about banking"
which it shouldn't be.Note, there is this example https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py which shows to apply a model to create clusters - I want to fine-tune the model based on "clusters"