Prevent incorrect column order issues

Hello!

Sentence Transformers doesn't care much about the names of your columns (except "score" and "label"), and just takes the other columns in order as the inputs to your losses. If your dataset has the columns in the ["positive", "anchor"] order, then you might get some rather "meh" results. See for example https://huggingface.co/MagnusSa/nb-sbert-base-utdanning-20/discussions/1

When combined with MultipleNegativesRankingLoss, this will optimize "given the first column, find which sample is most likely the matching value in the second column". In this case, it's now training to optimize "given the answer, what is the matching question?". Not ideal.

I'm interested in adding some warnings when we automatically detect when this happens, but I'm not sure what the best approach is. Perhaps we can read the column names and if we recognize them, but they're "out of order", then we give a warning? E.g. if someone uses a "anchor" column but it isn't the first non-label column? Idem with "positive", "negative", "sentence1", "sentence2", "sentence_1", "sentence_2", "sentence_A", "sentence_B", etc.

Tom Aarsen

UKPLab / sentence-transformers

Prevent incorrect column order issues #2791