Closed andy1xx8 closed 11 months ago
If the query is identical to one of the passages in ANY language not just your language, They are not ranked 1st, they are ranked last and oddly scored low. This includes English WHEN using optimised multilingual models.
But you won't observe this problem in the optimised english only models with English text. Identical passage gets ranked appropriately.
We have observed this early on with multilingual model.
[Solution] Semantically speaking it may sound like a challenge but practically speaking query identical to passage doesn't make realistic sense. But If in your system users can type queries identical to passages and/or if this usecase is important to you, You have 3 options
if query == passage
and manually pushing it isn't hard. bert-base-multilingual-cased
on custom Vietnamese data and use it for ranking. Fair warning model is 700MB. You might need GPU for a faster inference.BAAI/bge-reranker-base
on custom Vietnamese data and use it for ranking. using this script. Fair warning model is 1.1GB. You might need GPU for a faster inference.Wish you the best
Hi Im trying to test your ranking with Vietnamese data and got a weird result like this:
The last result is exactly the same as the query but got the lowest score. This is so weird to me.
Here is the code snippet to produce the above result