UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.53k stars 2.41k forks source link

Should I use SoftmaxLoss for a binary classification task? #2754

Open GeraldFZ opened 1 month ago

GeraldFZ commented 1 month ago

Hi, thanks for sharing your work and they are great!

Here I have a binary classification task with labels 0 and 1 by embeddings of two sentences, may I ask if should I use SoftmaxLoss for this binary task by just setting the parameter _num_labels_ as 2, I am hesitating because it seems to use CrossEntropyLoss for the loss function but not binary cross entropy... if not may I know what would you recommend to use for a loss function? @tomaarsen @nreimers Thanks a lot! :)

tomaarsen commented 1 month ago

Hello!

It actually sounds a bit like you have 1 label, which either has a value of 0 or 1? Is that right? If so, SoftmaxLoss with num_labels=1` is an option I believe. I'm not sure whether BinaryCrossEntropy is a better loss function to use with SoftmaxLoss, but I assume that it's possible.

That said, SoftmaxLoss might not be the strongest option. You can consider converting your dataset to one that's compatible with e.g. CosineSimilarityLoss or MultipleNegativesRankingLoss as shown here: https://sbert.net/docs/sentence_transformer/loss_overview.html

Additionally, you might be interested in the SetFit project: https://github.com/huggingface/setfit/

GeraldFZ commented 1 month ago

Hello!

It actually sounds a bit like you have 1 label, which either has a value of 0 or 1? Is that right? If so, SoftmaxLoss with num_labels=1` is an option I believe. I'm not sure whether BinaryCrossEntropy is a better loss function to use with SoftmaxLoss, but I assume that it's possible.

That said, SoftmaxLoss might not be the strongest option. You can consider converting your dataset to one that's compatible with e.g. CosineSimilarityLoss or MultipleNegativesRankingLoss as shown here: https://sbert.net/docs/sentence_transformer/loss_overview.html

Additionally, you might be interested in the SetFit project: https://github.com/huggingface/setfit/

  • Tom Aarsen

Thanks a lot for your answer!! I was trying using CosineSimilarityLoss with BCE for this Binary classification task, and considering your answer to my other question: https://github.com/UKPLab/sentence-transformers/issues/2753

Thanks and all the best Zhe

tomaarsen commented 1 month ago

may I ask your opinion should I use sigmoid or (cos_sim + 1) / 2 or any other way to do this?

I think (cos_sim + 1) / 2 makes more sense than (default) sigmoid, as sigmoid will only use 0.3 <-> 0.7 or so when fed with values between -1 and 1. That said, the cosine similarity is rarely negative, and perhaps you should push "unrelated" to a cosine similarity of 0 rather than -1, e.g. like in #2753. With other words, perhaps the best solution is relu(cos_sim) to just replace all negatives with 0?

GeraldFZ commented 1 month ago

may I ask your opinion should I use sigmoid or (cos_sim + 1) / 2 or any other way to do this?

I think (cos_sim + 1) / 2 makes more sense than (default) sigmoid, as sigmoid will only use 0.3 <-> 0.7 or so when fed with values between -1 and 1. That said, the cosine similarity is rarely negative, and perhaps you should push "unrelated" to a cosine similarity of 0 rather than -1, e.g. like in #2753. With other words, perhaps the best solution is relu(cos_sim) to just replace all negatives with 0?

  • Tom Aarsen

Thanks for your reply and they are very inspiring! So, as I understood, according to our discussion:

  1. for this binary classification task using BCE, since I need to map the cos to (0,1), using (cos_sim + 1) / 2 or relu(cos_sim) to transform cos sim should be ok, or let's say at least sounds more feasible than sigmoid.

  2. for my nonbinary prediction with labels value from 0-1, according to your experiments, using raw cos sim (-1, 1 ) should be ok, and actually as you said in practice: using ranges of 0 to 1 simply seems to result in higher scores. even so, in this task, specifying the label and cos value to the same range (a. mapping cos to (0,1) or b. mapping my labels to (-1,1)) will somehow deserve a try and see the experiments.

May I know did I understood it correctly?

Thanks for your patience!

Best Zhe