apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Use BoostAttribute in in TokenFilters to denote Terms that QueryParser should give lower boosts [LUCENE-3130] #4203

Open asfimport opened 13 years ago

asfimport commented 13 years ago

A recent thread asked if there was anyway to use QueryTime synonyms such that matches on the original term specified by the user would score higher then matches on the synonym. It occurred to me later that a float Attribute could be set by the SynonymFilter in such situations, and QueryParser could use that float as a boost in the resulting Query. IThis would be fairly straightforward for the simple "synonyms => BooleamQuery" case, but we'd have to decide how to handle the case of synonyms with multiple terms that produce MTPQ, possibly just punt for now)

Likewise, there may be other TokenFilters that "inject" artificial tokens at query time where it also might make sense to have a reduced "boost" factor...

In all of these cases, the amount of the "boost" could me configured, and for back compact could default to "1.0" (or null to not set a boost at all)

Furthermore: if we add a new BoostAttrToPayloadAttrFilter that just copied the boost attribute into the payload attribute, these same filters could give "penalizing" payloads to terms when used at index time) could give "penalizing" payloads to terms.


Migrated from LUCENE-3130 by Chris M. Hostetter (@hossman), 1 vote, updated Feb 27 2015 Linked issues:

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Hoss Man,

I don't think I agree that a boost attribute is the best way to implement this.

A QP can already solve this issue today, simply by boosting down terms with positionIncrement = 0. This would solve all of the cases you listed, without making these tokenstreams more complicated.

If such a QP really needs to know more than positionIncrement=0, then a better approach would be to set token types (need not be TypeAttribute, could be something more strongly-typed), to indicate synonym, phonetic variation, etc etc.

But I really think the implementation details of QP should remain in QP, the analysis chain should instead be general and describe up the text.

Otherwise, things get really confusing, e.g. what should a ShingleFilter do when it combines two tokens that have different BoostAttributes? But with types, this is no problem at all, because the ShingleFilter can simply set the type to 'shingle' and its unambiguous... its up to the consumer to do whatever it wants with this.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Furthermore: if we add a new BoostAttrToPayloadAttrFilter that just copied the boost attribute into the payload attribute, these same filters could give "penalizing" payloads to terms when used at index time) could give "penalizing" payloads to terms.

Again, I think this is at the wrong level. If you do what you describe, what if you then want to tweak the ranking for synonyms? You must reindex.

Instead, its far better to use TypeAsPayloadFilter and put the type into the payload. Then you can tweak scorePayload() to your hearts content to adjust the ranking without reindexing all documents.

asfimport commented 13 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

A QP can already solve this issue today, simply by boosting down terms with positionIncrement = 0.

That assumes: a) that every TokenFilter which might inject terms like this will always put the most important one first b) that the amount of boost should be fixed

what i'm suggesting is that we make this more flexible so that people wiring together their apps and analyzers have an easy way to guide the queryParsers behavior. if we have allow a well defined attribute for this people can have custom analysis that specify arbitrary boosts in cases we may not be able to specificly anticipate. (synonyms, entity recognition, common word demoting, etc..)

But I really think the implementation details of QP should remain in QP, the analysis chain should instead be general and describe up the text.

why don't you consider an attribute that denotes "this term is worth less then a typical term" a general description of the text?

Otherwise, things get really confusing, e.g. what should a ShingleFilter do when it combines two tokens that have different BoostAttributes?

It does whatever it already does when it encounters two tokens that may have attributes it doesn't know about (ignore them when creating the new token, if i remember correctly). Unrecognized attributes isn't a new problem.

If you do what you describe, what if you then want to tweak the ranking for synonyms? You must reindex.

how is that any different from any other aspect of index time synonyms? if you use them you always have to reindex when you change your synonyms.

I'm not arguing that index time synonyms is a good idea in general, i'm not arguing that this "we look for BoostAttributes on tokens" feature of the QP would be useful (or even a good idea for everyone). I'm arguing that having such a feature would provide an easy way for people who are alreayd customizing their analysis to easily modify/influence the behavior of the query parser (w/o subclassing) that could still easily work in conjunction with other techniques.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

why don't you consider an attribute that denotes "this term is worth less then a typical term" a general description of the text?

I would rather it be more descriptive: for example an attribute that notes this is a Synonym is useful. If its a named entity, mark it as a named entity.

But i don't want to see float values cranked into the tokenfilters, I think this is messy. I think the analysis process should instead describe up the text for the consumer (e.g. queryparser) to do with as they please: i.e. in this case its the queryparser's job to then turn this into some concrete query that does something magic: if thats downboosting synonyms, but maybe it would do something different, like drop them alltogether.

asfimport commented 13 years ago

Jan Høydahl (@janhoy) (migrated from JIRA)

The feature is absolutely needed. Probably it's enough to be able to specify a global term boost factor per query for all synonyms, so Robert's method would work for me.

Another usecase is Phonetic variants. Currently I use a separate field for phonetic normalization and include it with a lower weight in DisMax. If phonetic variant instead was stored alongside the original with posIncr=0 and tokenType=phonetic, I could instead specify a deboost factor for phonetic terms and even highlighting would work ootb!

Yet another is lower/upper case search. If the LowerCaseFilter would keep the original token and add a lowercased token on same posIncr with tokenType=lowercase, we could support case insensitive match with preference for correct case.

If user needs different boost for different fields, perhaps the TokenType name could be configurable on each filter.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Currently I use a separate field for phonetic normalization and include it with a lower weight in DisMax. If phonetic variant instead was stored alongside the original with posIncr=0 and tokenType=phonetic, I could instead specify a deboost factor for phonetic terms and even highlighting would work ootb!

This doesn't make any sense to me: how is this "better" shoved into one field than two fields? I don't see any advantage at all. field A with original terms and field B with phonetic terms is no less efficient in the index than having field AB with both mixed up, but keeping them separate keeps code and configurations simple.

As for the highlighting, that sounds like a highlighting problem, not an analysis problem. If its often the case that users use things like copyField and do this boosting, then highlighting in Solr needs to be fixed to correlate the offsets back to the original stored field: but we need not make analysis more complicated because of this limitation.

If the LowerCaseFilter would keep the original token and add a lowercased token on same posIncr with tokenType=lowercase, we could support case insensitive match with preference for correct case.

I don't think we should complicate our tokenfilters with such things: in this case I think it would just make the code more complicated and make relevance worse: often case is totally meaningless and boosting terms for some arbitrary reason will skew scores.

This is for the same reason as above. If you want to do this, I think you should use two fields, one with no case, and one with case, and boost one of them.

asfimport commented 13 years ago

Jan Høydahl (@janhoy) (migrated from JIRA)

Let's get back to the original issue: we need some way to let the "original" form of a term have higher weight than the alternative forms generated by analysis (whether those are synonyms, stems, lowercase or what have you).

Is tagging the added tokens with a tokenType, and then enabling the QParsers to act on these tokenTypes a viable way forward?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Let's get back to the original issue: we need some way to let the "original" form of a term have higher weight than the alternative forms generated by analysis (whether those are synonyms, stems, lowercase or what have you).

I'm not sure we do! see my last response. I think 2 fields is just fine.

As for things like synonyms, these already set TypeAttribute. So if your consumer wants to do something on synonyms, look for type = "<SYNONYM>" or whatever it already sets.

asfimport commented 13 years ago

Jan Høydahl (@janhoy) (migrated from JIRA)

Robert, two fields work great for supporting stuff like phonetic and stem/non-stem search, and also lower/exact-case search although index size could be lower with a one-field approach. Let's those use cases rest for now.

But for the synonym case, what remains is to modify the QueryParser to act on the already-present TypeAttribute, is that so? If so, let's open another issue for that.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But for the synonym case, what remains is to modify the QueryParser to act on the already-present TypeAttribute, is that so? If so, let's open another issue for that.

I think so? Though it might be more useful not to modify the core queryparser for this? The reason is that such a feature is geared towards synonyms and multi-word synonyms don't work well with it... So maybe instead to a simpler queryparser that does work well with multi-word synonyms by default?

asfimport commented 13 years ago

Jan Høydahl (@janhoy) (migrated from JIRA)

The core of this issue is providing a mechanism for deboosting synonyms, and as long as it works with single-term synonyms that at least covers what most people use today. I propose we handle that first. Agree that it would be nice with a query-parser which can handle multi word synonyms. But that could be handled incrementally in a separate issue.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Jan, not sure that most people only use single-term synonyms... if this is the case maybe we should rethink our synonyms implementation because multi-word adds a ton of complexity!

Another reason I suggested avoiding adding this to the core queryparser is because its going to be challenging to allow this optional boosting in a flexible way (just look at the getFieldQuery... its very hairy). I think in the ideal case, we somehow restructure all this code so that subclasses have more control over how the query is created... however I think this might be challenging just given how the code is structured now.

The reason I think it would be best exposed as a 'hook' to subclasses (versus adding a "deboost synonyms" option directly to the core QP), is that I think people are going to want to customize how this works, e.g. control it per-field and things like that.

At the end of the day, a queryparser could always subclass getFieldQuery completely and do this now, but thats not great either because the code is so hairy :(

This kind of feature might be easier to implement with the new queryparser in contrib, but I'm not sure.

asfimport commented 13 years ago

Jan Høydahl (@janhoy) (migrated from JIRA)

I'm not saying that most people use or want single-term synonyms, but since that's the only thing that works query-time with Solr now, most people accept that limitation, but it's still useful. I tend to split synonyms in two dictionaries - a separate one with multi term synonyms for use on index side. Not a perfect situation, but that's what we got and how people use it.

I'm not into the code, and I have no reason do doubt your judgement that the cost/benefit prohibits adding this to current QParser :-(

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Well, I don't think it prohibits it?

This kinda refactoring/feature would be really good, if we could refactor our queryparser stuff to make it easier to customize how queries are built, you know thats exactly the kind of thing we should be doing!

I'm just saying I think right now it would be painful to do with lucene's core QP given its current design.

asfimport commented 12 years ago

Jan Høydahl (@janhoy) (migrated from JIRA)

Adding a dependency to the contrib Flexible QueryParser.