UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.21k stars 2.47k forks source link

Prevent incorrect column order issues #2791

Open tomaarsen opened 4 months ago

tomaarsen commented 4 months ago

Hello!

Sentence Transformers doesn't care much about the names of your columns (except "score" and "label"), and just takes the other columns in order as the inputs to your losses. If your dataset has the columns in the ["positive", "anchor"] order, then you might get some rather "meh" results. See for example https://huggingface.co/MagnusSa/nb-sbert-base-utdanning-20/discussions/1

When combined with MultipleNegativesRankingLoss, this will optimize "given the first column, find which sample is most likely the matching value in the second column". In this case, it's now training to optimize "given the answer, what is the matching question?". Not ideal.

I'm interested in adding some warnings when we automatically detect when this happens, but I'm not sure what the best approach is. Perhaps we can read the column names and if we recognize them, but they're "out of order", then we give a warning? E.g. if someone uses a "anchor" column but it isn't the first non-label column? Idem with "positive", "negative", "sentence1", "sentence2", "sentence_1", "sentence_2", "sentence_A", "sentence_B", etc.

milistu commented 3 months ago

HI @tomaarsen,

I am interested in this issue, but I would need some more help. As I understand this dataset checking will be implemented somewhere here? I suggest adding a class method that can be called and checking the data structure that is passed. Am I on the right path?

Thank you in advance,