Open markharwood opened 8 years ago
Ah interesting, I wasn't aware of this subtlety. I agree we should fix up how the Guide describes it (and remove that boosting example). And add a warning section about the dangers of field-boosting when using cross-fields.
I'll work up a PR and ping you when it's done to see what you think.
Great, thanks
In this section there is this claim:
This is nearly right - the minimum DF is taken but the "wrong" fields (e.g. surname:peter) have minDF+1 so that the "right" field (forename:peter) is scored more highly (we plus one to DF because it penalizes IDF scores). This subtle scoring tweak is intended to prefer the right field and so
cross_fields
could perhaps be calledmost_likely_field
.All of this subtle scoring tweaking is undone if the user is encouraged to provide field-level boosts as in this example. When a user provides boosts, he and the algorithm are in a fight as to who knows best. Cross_fields is examining the detail of each of the words provided to come up with the correct context for each of them and meanwhile the user is stepping in with some broad-brush preference that
title
fields are generally better thandescription
fields which overrides the subtle +1 tweaks that ensure the correct interpretation of each word wins.In implementing cross_fields I actually wanted to throw an error if the user attempted to provide field-level boosts - it's like having two people fighting over a steering wheel. However it was felt that in rare situations cross_fields could be (ab)used to flatten the inequalities in IDF before then applying a user's desired field-level boosts. This is not cross_field's primary use case and in my view we shouldn't promote this.
The problem is that despite a lot of documentation people aren't appreciating these subtleties and that boosts should only be applied if they really know what they are doing. Yesterday I was looking at a user's relevance ranking issue and they had followed the advice here which mixed cross_fields with boosting and were getting poor results.
Multi-field is a nightmare of ranking issues and perhaps a good reasons to hang on to the
_all
field rather than promoting its demise