UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.84k stars 2.44k forks source link

The BatchHardTripletLoss loss function is underperforming. #279

Open yangtianyu92 opened 4 years ago

yangtianyu92 commented 4 years ago

why?

cpcdoy commented 4 years ago

The performance of this loss greatly depends on your dataset. You need to provide it with good, labeled data so that it can pick relevant triplets.

You also need to be able to have a good distribution of your data inside each batch during training because you could potentially only have bad triplets that could lead to your model learning the wrong data distribution. Maybe also try bigger batch sizes if you can fit bigger so you have more chance of having a good variety of triplets in a given batch.

You should not forget that this loss has a hyperparameter that is the "margin" and its value can greatly influence the resulting performance.

On my side, I've had very good results with this loss for my use case on fine-tuning over a large multilingual domain-specific dataset.

Finally, I've actually been having even better results with a modification of this loss using a soft margin which means you don't need to set a margin manually. I've also implemented semi-hard triplet loss but had less interesting results on my dataset.

I could contribute my losses if there is interest.

yangtianyu92 commented 4 years ago

The performance of this loss greatly depends on your dataset. You need to provide it with good, labeled data so that it can pick relevant triplets.

You also need to be able to have a good distribution of your data inside each batch during training because you could potentially only have bad triplets that could lead to your model learning the wrong data distribution. Maybe also try bigger batch sizes if you can fit bigger so you have more chance of having a good variety of triplets in a given batch.

You should not forget that this loss has a hyperparameter that is the "margin" and its value can greatly influence the resulting performance.

On my side, I've had very good results with this loss for my use case on fine-tuning over a large multilingual domain-specific dataset.

Finally, I've actually been having even better results with a modification of this loss using a soft margin which means you don't need to set a margin manually. I've also implemented semi-hard triplet loss but had less interesting results on my dataset.

I could contribute my losses if there is interest.

THANK YOU,I think because my dataset is too simple probably。

lrizzello commented 4 years ago

The performance of this loss greatly depends on your dataset. You need to provide it with good, labeled data so that it can pick relevant triplets.

You also need to be able to have a good distribution of your data inside each batch during training because you could potentially only have bad triplets that could lead to your model learning the wrong data distribution. Maybe also try bigger batch sizes if you can fit bigger so you have more chance of having a good variety of triplets in a given batch.

You should not forget that this loss has a hyperparameter that is the "margin" and its value can greatly influence the resulting performance.

On my side, I've had very good results with this loss for my use case on fine-tuning over a large multilingual domain-specific dataset.

Finally, I've actually been having even better results with a modification of this loss using a soft margin which means you don't need to set a margin manually. I've also implemented semi-hard triplet loss but had less interesting results on my dataset.

I could contribute my losses if there is interest.

Could you contribute your loss functions, please? I'm also attempting to use the BatchHardTripletLoss and have created a custom sampler to make sure the batches make sense but the results I get are much worse than with a "classic" triplet loss

Probably not directly related to this issue, but I also want to point out that I had to change one line in the "BatchHardTripletLoss.py" file for my script to work, namely I had to change a .byte() by a .bool() in the spot underlined in the picture below. But I doubt this is what is causing my bad results as it should be equivalent image

cpcdoy commented 4 years ago

I'll try to contribute my losses when I get enough time to clean the code and make a PR.

Also, the problem you're describing is fixed in PR #254 I made some time ago and it's already merged in master but probably isn't packaged in a release yet.

lrizzello commented 4 years ago

Since you mentionned changes that weren't packaged in the release yet, I went over the other changes to see if anything else could be causing problems, and I noticed that the SentenceLabelDataset class had recently been rewritten but that those changes were not yet in the release. So I tried this updated class out, had to rewrite my custom sampler since it no longer worked with the update and now everything works way better than before.

Thanks for your insight!

cpcdoy commented 4 years ago

No problem!

So, I haven't tried the SentenceLabelDataset class myself tbh because I've been using custom datasets and I've only used custom samplers every time too. So if you're saying that the class is broken, then it might be an issue that too.

souravsaha commented 4 years ago

The performance of this loss greatly depends on your dataset. You need to provide it with good, labeled data so that it can pick relevant triplets. You also need to be able to have a good distribution of your data inside each batch during training because you could potentially only have bad triplets that could lead to your model learning the wrong data distribution. Maybe also try bigger batch sizes if you can fit bigger so you have more chance of having a good variety of triplets in a given batch. You should not forget that this loss has a hyperparameter that is the "margin" and its value can greatly influence the resulting performance. On my side, I've had very good results with this loss for my use case on fine-tuning over a large multilingual domain-specific dataset. Finally, I've actually been having even better results with a modification of this loss using a soft margin which means you don't need to set a margin manually. I've also implemented semi-hard triplet loss but had less interesting results on my dataset. I could contribute my losses if there is interest.

Could you contribute your loss functions, please? I'm also attempting to use the BatchHardTripletLoss and have created a custom sampler to make sure the batches make sense but the results I get are much worse than with a "classic" triplet loss

Probably not directly related to this issue, but I also want to point out that I had to change one line in the "BatchHardTripletLoss.py" file for my script to work, namely I had to change a .byte() by a .bool() in the spot underlined in the picture below. But I doubt this is what is causing my bad results as it should be equivalent image

Yes thanks, even I was getting the same error.

cpcdoy commented 4 years ago

Finally got time to make a PR (#299) as requested in this thread.

The losses I provide are drop-in replacements for BatchHardTripletLoss and can help get even better results.

Hope this helps

yangtianyu92 commented 4 years ago

update your sentence-transformers,check laber readers.py,you will find old version didn't finish it yet。

On 07/17/2020 02:18, Mark Preston wrote:

I've been having issues getting better results than the "out-of-the-box" embeddings as well when using the triplet loss functions. I have a domain specific set with around 8K sentences, each with 1 of 4 labels (mostly balanced, though one is slightly larger than the other 3) and I'm setting up the problem with the SentenceLabelDataset class as well. When reviewing the embeddings coloured with a label using TSNE, they seem to collapse after training but are initially reasonably separate when just using the .encode() method with multi-lingual model. As such, the predictions then suffer. Is the set too small? I've tried many configurations but can't make any results improvements. You also mentioned a custom sampler, what approach did you take? Also, the BatchHardSoftMarginTripletLoss renders an error if using a CPU with an error instructing to install a GPU and NVIDIA driver

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.