elastic / elasticsearch-definitive-guide

The Definitive Guide to Elasticsearch
https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html
Other
3.55k stars 2.83k forks source link

Cross_fields guidance bad? #587

Open markharwood opened 8 years ago

markharwood commented 8 years ago

In this section there is this claim:

In other words, it looks up the IDF of smith in both the first_name and the last_name fields and uses the minimum of the two as the IDF for both fields.

This is nearly right - the minimum DF is taken but the "wrong" fields (e.g. surname:peter) have minDF+1 so that the "right" field (forename:peter) is scored more highly (we plus one to DF because it penalizes IDF scores). This subtle scoring tweak is intended to prefer the right field and so cross_fields could perhaps be called most_likely_field.

All of this subtle scoring tweaking is undone if the user is encouraged to provide field-level boosts as in this example. When a user provides boosts, he and the algorithm are in a fight as to who knows best. Cross_fields is examining the detail of each of the words provided to come up with the correct context for each of them and meanwhile the user is stepping in with some broad-brush preference that title fields are generally better than description fields which overrides the subtle +1 tweaks that ensure the correct interpretation of each word wins.

In implementing cross_fields I actually wanted to throw an error if the user attempted to provide field-level boosts - it's like having two people fighting over a steering wheel. However it was felt that in rare situations cross_fields could be (ab)used to flatten the inequalities in IDF before then applying a user's desired field-level boosts. This is not cross_field's primary use case and in my view we shouldn't promote this.

The problem is that despite a lot of documentation people aren't appreciating these subtleties and that boosts should only be applied if they really know what they are doing. Yesterday I was looking at a user's relevance ranking issue and they had followed the advice here which mixed cross_fields with boosting and were getting poor results.

Multi-field is a nightmare of ranking issues and perhaps a good reasons to hang on to the _all field rather than promoting its demise

polyfractal commented 8 years ago

Ah interesting, I wasn't aware of this subtlety. I agree we should fix up how the Guide describes it (and remove that boosting example). And add a warning section about the dangers of field-boosting when using cross-fields.

I'll work up a PR and ping you when it's done to see what you think.

markharwood commented 8 years ago

Great, thanks