Training embeddings on imbalance data

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

14.9k stars 2.44k forks source link

Training embeddings on imbalance data #734

Open huhk-sysu opened 3 years ago

huhk-sysu commented 3 years ago

Hi,

I'm trying training sentence embedding, but I only have an imbalance data, say, there are 3500 sentence pairs with a similar score 0.0, 2000 pairs with a score 0.5, while only 250 pairs with a score 1.0.

I know it's hard to train a good classifier on imbalance data, but I wonder if the imbalance data would also hurt the performance when I use the STS training style. Could you offer some suggestions?

Thank you!

nreimers commented 3 years ago

This will not be an issue