Open tomaarsen opened 4 months ago
HI @tomaarsen,
I am interested in this issue, but I would need some more help. As I understand this dataset checking will be implemented somewhere here? I suggest adding a class method that can be called and checking the data structure that is passed. Am I on the right path?
Thank you in advance,
Hello!
Sentence Transformers doesn't care much about the names of your columns (except "score" and "label"), and just takes the other columns in order as the inputs to your losses. If your dataset has the columns in the
["positive", "anchor"]
order, then you might get some rather "meh" results. See for example https://huggingface.co/MagnusSa/nb-sbert-base-utdanning-20/discussions/1When combined with
MultipleNegativesRankingLoss
, this will optimize "given the first column, find which sample is most likely the matching value in the second column". In this case, it's now training to optimize "given the answer, what is the matching question?". Not ideal.I'm interested in adding some warnings when we automatically detect when this happens, but I'm not sure what the best approach is. Perhaps we can read the column names and if we recognize them, but they're "out of order", then we give a warning? E.g. if someone uses a "anchor" column but it isn't the first non-label column? Idem with "positive", "negative", "sentence1", "sentence2", "sentence_1", "sentence_2", "sentence_A", "sentence_B", etc.