UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.84k stars 2.44k forks source link

cross-encoder/quora-distilroberta-base has false positive for short word #2919

Closed mrenlivex closed 1 week ago

mrenlivex commented 2 weeks ago

similarity between

"What are the differences between iHealth Nexus Wireless Body Composition Scale (HS2S) (Nexus/Fit) and iHealth Nexus Pro Wireless Body Composition Scale (HS2S Pro) (NEXUS PRO)"

and

"FAQ"

0.9063025712966919

tomaarsen commented 1 week ago

Hello!

Dense NLP models remain black boxes that can make mistakes, especially when the inputs differ a lot from the inputs encountered during training. https://huggingface.co/cross-encoder/quora-distilroberta-base was trained on Quora questions, so it's not too surprising that it doesn't do well with:

"FAQ"

Perhaps you'll find better luck with some other cross-encoder, but there'll likely always be outliers where the model gives odd results. I'll close this for now, as I don't think we can fix this issue outright.