apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.03k forks source link

always apply position increment gap between values [LUCENE-2529] #3603

Closed asfimport closed 14 years ago

asfimport commented 14 years ago

I'm doing some fancy stuff with span queries that is very sensitive to term positions. I discovered that the position increment gap on indexing is only applied between values when there are existing terms indexed for the document. I suspect this logic wasn't deliberate, it's just how its always been for no particular reason. I think it should always apply the gap between fields. Reference DocInverterPerField.java line 82:

if (fieldState.length > 0) fieldState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);

This is checking fieldState.length. I think the condition should simply be: if (i > 0). I don't think this change will affect anyone at all but it will certainly help me. Presently, I can either change this line in Lucene, or I can put in a hack so that the first value for the document is some dummy value which is wasteful.


Migrated from LUCENE-2529 by David Smiley (@dsmiley), resolved Sep 28 2010 Environment:

(I don't know which version to say this affects since it's some quasi trunk release and the new versioning scheme confuses me.)

Attachments: LUCENE-2529_always_apply_position_increment_gap_between_values.patch, LUCENE-2529_nonsenseIncrements.patch, LUCENE-2529_skip_posIncr_for_1st_token.patch (versions: 3), LUCENE-2529_test.patch Linked issues:

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I agree it's weird that the if checks fieldState.length instead of i; we should fix it, though, this is a subtle break in back-compat.

Once we do #3385 (fully decouple indexing & analysis) and #3524 (write-once attr bindings in the analyzer chain), this sort of fix would be nicely "external" to Lucene. Ie the semantics of how positions / offsets increment across boundaries of multiple fields would be fully determined by the app/analyzer, not baked into Lucene's core.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

How does this change backwards compatibility? The index format is certainly the same. And there's no documented or implied contract I'm breaking with this change in increment gap behavior. I'd be really surprised to learn that anyone actually depends on the current behavior. When I went into the code I was figuring it might smartly only apply the increment gap if the offset changed from one value to another but it doesn't even do that, and so the current algorithm strikes me as a bit arbitrary. Seems like a safe one-liner commit to me. But then I'm biased as the reporter ;-)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Well, if an app has a multi-valued field today where the first N (> 1) values analyze to 0 tokens, then this change will alter the positions of subsequent tokens.

Still, I agree, it seems unlikely that an app is relying on this... so I think we can just break it (and advertise that we did so, in CHANGES under back compat breaks).

Note that offsets also have logic that avoids adding the offset gap if there were no tokens; but it's slightly different since it will not add the gap if the current value in the multi-valued field had no tokens (whereas the logic for the position gap is only if we've seen net 0 tokens so far). Seems like we should also fix offset to always add the gap?

Wanna cons up a patch...?

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I changed the conditional as described in this issue – a one-liner.

I have no comment on the "offset gap" as I'm not familiar with the logic you speak of, Michael.

asfimport commented 14 years ago

Koji Sekiguchi (@kojisekig) (migrated from JIRA)

I'll see this issue and #3742 all together.

asfimport commented 14 years ago

Koji Sekiguchi (@kojisekig) (migrated from JIRA)

trunk: Committed revision 1001796 and 1001957. branch_3x: Committed revision 1001991.

Thanks, David!

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I've been so busy; I intended to return to this issue later in the week. Unfortunately this issue isn't completely resolved in my testing. It's good that the position increment gap is always applied now, but there is some other increment logic in this class that foils my attempts to coordinate the positions across multiple fields, particularly when the first value is non-existent for a multi-valued field. I'll post more about it later.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Always adding the position increment is good but insufficient to solve my problem.

A new patch rectifies the followup situation I reported inadvertently to #3742 that I should have said here. The jist is that DocInverterPerField conditionally decrements the position and then always increments it, and this is problematic for attempting to keep position increments across several multi-value fields aligned (using an analyzer setting posIncr to 0) when the first value generates no tokens (either blank or stop words). Mike McCandless pointed out that the unfortunate existing logic had to do with preventing the position from becoming 1 which doesn't work with payloads - #2616.

My new patch here doesn't even have a pre-decrement nor post-increment and thus I find the code easier to follow. It ignores the provided position increment of the first token (typically 1), voiding the need to shift them back and forth. There is one oddity included here and that is I always add 1 to the position increment gap (i.e. between values). With this oddity included, all the tests pass (except for the test for this very issue, which I correct in this patch) --yay! Without this oddity, a handful of tests failed that depended on the first token adding one to the position. My +1 up at the value loop can be seen as actually enforcing that the first token's position is 1, and also adding a +1 for when there is no token for a value (critical for aligning multiple fields). Perhaps this +1 should happen at a different line number to be less confusing but the end result should be the same.

I expect for many people this is very confusing, especially if you're not knee deep in this subject as I am presently. Mike, hopefully you're understanding what I'm up to here. The tests pass, remember.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I like the idea of disregarding the posIncrGap of the first token.

Maybe, instead of that +1 inside IW, we change the default posIncrGap to 1?

Can you spell out examples of how the indexed positions will change w/ this patch – I'm having trouble visualizing this. EG for a single valued field, multi-valued, etc.

For one, because we now ignore the token's posIncrGap if it's the first token (per value in the field), any tokenizer that alters the posIncr from the default (1) will now see different positions indexed, right?

Man I really want to get this logic out of indexer and into the analysis chain (LUCENE-2450 enables this). How multi-valued streams should handle the transition from one value to another shouldn't be inside the indexer... and maybe (someday) tokens should store their position (not the gap) so we don't have this cryptic logic inside the indexer...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Duh, sorry I meant "I like the idea of disregarding the posIncr of the first token" (not the posIncrGap).

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

(patch updated)

Maybe, instead of that +1 inside IW, we change the default posIncrGap to 1?

I had the +1 for the gap (i.e. between values) level because I was trying to get a blank value (or a value consisting of stop words) to bump the position counter as well. I've been tinkering with this a bit more and I realize now that I can still achieve my aims without doing that, but it's still necessary to ignore the very first position increment of the very first value – only. See the new patch. I think the result now should be even more amenable to others (i.e. is least disruptive) since anyone messing with the position increment of the first token of subsequent values will still be honored.

Can you spell out examples of how the indexed positions will change w/ this patch - I'm having trouble visualizing this. EG for a single valued field, multi-valued, etc.

A single valued field is unaffected. The first emitted token (if there are any at all) will remain at position 0 no matter what the analyzer does. This is also true for the first value of a multi-valued field if there is any.

For multi-valued fields, it is now always the case that the first token of subsequent values (e.g. not the first value) will be the previous position (0 if none) + the gap + the first position increment of this value (typically 1). This is consistent and sensible. Formerly, if the first value was a blank value (or a value consisting of stop words), then you'd get 1 less than what you get now. I hope the test I modified as part of this patch makes this more clear; I had to increment the tested positions by 1.

As I said before, I also think that the code is more clear since it no longer has that conditional pre-decrement and post increment of the position that was probably only understood by you. And I did away with the weird "+1" at the gap in my previous patch.

Man I really want to get this logic out of indexer and into the analysis chain (LUCENE-2450 enables this). How multi-valued streams should handle the transition from one value to another shouldn't be inside the indexer... and maybe (someday) tokens should store their position (not the gap) so we don't have this cryptic logic inside the indexer..

That sounds great. There are other strategies of messing with position increments that I simply can't do without hacking this code further. For example, it would be neat if the first token of a value could be devised to start at posIncGap*valueIndex (ex: 0, 1000, 2000, ...) so that Span queries could determine which value index a term matched against by looking at it's position (ex: 3092: divide by 1000, drop remainder, add 1: the 4th value ).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

but there is some other increment logic in this class that foils my attempts to coordinate the positions across multiple fields, particularly when the first value is non-existent for a multi-valued field.

I don't think this will ever completely work correctly though, for example the only way to "accumulate" positions across a field that ends with a stopword / all stopwords would be to add a positional equivalent to end() that we did for offsets, so that highlighting would work for multi-valued fields. otherwise reset() -> clearAttributes() and the PositionIncrementAttribute is "lost"

I think if you are trying to coordinate positions across multi-valued fields, it probably means you shouldn't be using a multivalued field. It seems to only make sense to put multiple independent values in a multi-valued field, not ones that are "continuations" of each other. By definition if they are continuations of each other, I think its better to index the whole thing as single-valued... and you have complete control of the positions within your tokenstream.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Rob, I don't completely follow your first paragraph, but this patch I posted does work correctly for what I'm doing. It is one part of the puzzle in searching multi-valued fields at coordinated positions. The other part is span queries with field masking. For my problem space, I'm willing to sacrifice the ability to do phrase queries. I can't do them because my solution forces all position increments to 0, but using a gap.

My patch here (and the patch already applied by Koji recently) for this issue isn't really code specific to the problem I'm solving, but it is necessary for my approach, and I think makes the logic in this class easier to follow and more consistent with respect to the behavior of the position increment gap and handling of the very first position increment. All existing tests pass. On the basis of that alone, I'm hopeful that you, Michael, and other committers are amenable to applying this patch.

I acknowledge I could try to operate completely within a single value and thereby have more control over the position increments. That's an idea I didn't think of, thanks. But I believe the point I just made on my patch being an improvement stands.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Rob, I don't completely follow your first paragraph

What i was trying to say, is that there's no way for positions to be properly accumulated across multi-valued fields. for example (i will use the pipe as a field separator and assume english stopwords):

brown fox | went to | market

In this case the index will "lose" the 2 position increments caused by "went", and "to", and they won't be reflected in the "market" position.

My suggestion is that if you have values like this with position dependencies, they are really one single value, not independent values, and don't belong in a multivalued-field.

In this case, if you simply index the entire content as one field, and in your tokenstream handle the separator however you want, and the "market" token will properly reflect whatever you previously did with the tokens, either via that separator and/or stopwords or other things.

For my problem space, I'm willing to sacrifice the ability to do phrase queries.

Right, but my concern is that other users are not. I don't think we should discard the first token's position increment value completely, will the QueryParser do this too?

My patch here (and the patch already applied by Koji recently) for this issue isn't really code specific to the problem I'm solving, but it is necessary for my approach

The previous patch (the one described on the issue) I definitely agreed with. But what you speak of here (discarding the first token's position) is different, and I'm not convinced its necessary for your approach (you could use a single-valued field).

All existing tests pass. On the basis of that alone, I'm hopeful that you, Michael, and other committers are amenable to applying this patch.

Well, unfortunately (not your fault at all!) that isn't very comforting to me. For example, the queryparser has very minimal tests wrt this sorta stuff, yet as I mentioned above its important to think about how it consumes tokenstreams, because if its inconsistent with the indexer then queries start returning less results.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here's a patch, showing what i mean about why it is bad to discard the first position.

The test uses SpanFirstQuery (imagine a custom queryparser that supports a "starts-with" operator).

the test passes on trunk, but fails if i apply the patch to discard the position of the first term.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

My suggestion is that if you have values like this with position dependencies, they are really one single value, not independent values, and don't belong in a multivalued-field.

My use-case is not like your example. The value at index x has no relation to values before or after it in the same field. It does have a relationship to a separate multi-valued field's value at the same value index. With this patch and an analyzer that sets all position increments to 0, all tokens for the same value index across both fields will have the same token position.

(me) For my problem space, I'm willing to sacrifice the ability to do phrase queries.

Right, but my concern is that other users are not. I don't think we should discard the first token's position increment value completely, will the QueryParser do this too?

The fact that my entire solution (for which this patch is a subset) can't do phrases is not evident in this patch. Perhaps I should have kept my overall use case a secret so as not to cloud the purpose of this patch. This patch is about generating predictable/intuitive/sensible (in my opinion) position values, particularly at the very start of a field. I hope you would share my opinion if you apply the patch and examine the results such as what the test case the patch modifies does.

In my opinion (and apparently Mike McCandless – "I like the idea of disregarding the posIncrGap of the first token.") I disagree. The point of a position increment gap is only meaningful between values. It's meaningless for the first token. At your suggestion I looked in QueryParser.java to look for problems that may have to do with the position increment. I don't see any problems that would be introduced by this patch. For convenient reference here, the position increment is grabbed at line 608:

int positionIncrement = (posIncrAtt != null) ? posIncrAtt.getPositionIncrement() : 1;
if (positionIncrement != 0) {
  positionCount += positionIncrement;
} else {
  severalTokensAtSamePosition = true;
}

The ensuing logic seems pretty sane to me (albeit complicated). The only thing in here that could arguably be improved is line 645:

if (positionCount == 1 || (!quoted && !autoGeneratePhraseQueries)) {

I think the "== 1" should be "<= 1" just in case there's some intentional oddities in the analyzer that 99.9% of people wouldn't ever do. I'm in the exception case. But I have a different query time analyzer so I don't care that much.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The point of a position increment gap is only meaningful between values.

Thats not true, see my test for an example. Its a position increment, in my example I show how you would implement a "starts-with" operator consistent with how PhraseQuery works by default: it respects the "holes" from stopwords for better precision.

For example in the English language, lots of sentences start with stopwords so I think its a pretty big deal to just arbitrarily throw out the position of the first token in the document.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I think you're right Rob, I didn't think of that at all. I altered the patch further, and included your test as a part of it. I also added a little to the existing test I patched so that I consider a leading stop word, which is really the jist of what you are drawing attention to in your patch. The tests pass. This patch is retains the spirit of the earlier patch, but it honors the first position increment, expecting it to be >= 1. If it's 0, then the fieldState.position would end up being -1 but there's a quick check here that will correct it to be 0. That suits my needs, and the whole result I think is clearer and more consistent than the existing code & behavior.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

but it honors the first position increment, expecting it to be >= 1. If it's 0, then the fieldState.position would end up being -1 but there's a quick check here that will correct it to be 0.

Thank you... in my opinion here though (the zero case), we should just throw a hard Exception!

Because if the first token in a field has posIncrement = 0, this is meaningless, it means it has some negative position (it is floating around in the gap or before the field at all)

Personally I would prefer if any of the code inside lucene does something this stupid, that we get a hard test fail rather than a cover-up!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a patch for BaseTokenStreamTestCase that tests if the first value has a posInc=0 (nonsensical position increment).

The synonymfilter from solr fails with the test, but I don't really care... there is no point in indexing your text if you are going to put useless values in the index... I still think we should throw hard exception here.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Just to see how the tests fared with your suggestion, I commented out my "correction" of position from -1 to 0, and instead put in: assert fieldState.position >= 0 : "LUCENE-1542, pos < 0 doesn't work with payloads";

Some lucene tests failed: TestIndexWriter.testNegativePositions TestPositionIncrement.testSetPosition and testPayloadsPos0

I ran the Solr tests and didn't see failures that had to do with the position. No synonym failure like you reported.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I ran the Solr tests and didn't see failures that had to do with the position. No synonym failure like you reported.

Right, you need my assert to catch that...

So i really wish we could fix this, its a mess. An initial position increment of zero is meaningless, but lots of peoples analysis chains do it... the whole PositionIncrementAttribute is really broken.

e.g. just imagine a synonym-filter that injects "hte" for "the" followed by a stopfilter with setEnablePositionIncrements(false)

at the moment i don't see how we can remove the "turn zero into one" hack easily... i don't like it.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I sympathize with your point of view Rob; arguably a position of < 0 is illogical and an error. If the current behavior should stay then we can at least put a comment in the code here acknowledging the illogical nature of it.

How do you feel about committing my latest patch, with the addition of a suitable comment just mentioned?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

How do you feel about committing my latest patch, with the addition of a suitable comment just mentioned?

well honestly now that LUCENE-2529_skip_posIncr_for_1st_token.patch doesnt skip the posIncr for the first token, I'm having a tough time seeing what it changes?

I do like that when we do the "hack" correction of "turn zero into one" that the numOverlaps is now consistent... but what else changes?

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Quoting myself earlier:

For multi-valued fields, it is now always the case that the first token of subsequent values (e.g. not the first value) will be the previous position (0 if none) + the gap + the first position increment of this value (typically 1). This is consistent and sensible. Formerly, if the first value was a blank value (or a value consisting of stop words), then you'd get 1 less than what you get now. I hope the test I modified as part of this patch makes this more clear; I had to increment the tested positions by 1.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

David, sorry to make you repeat yourself.

I'm not sure about this change though, for example it breaks contrib/highlighter tests (I didnt look at the test to see more details).

However, if we decide to do it in the future, I think we should remove these special checks from the main loop: For example, instead of:

if (firstToken && i == 0)//i.e. this is the very first token we emit for this field in this document
              fieldState.position--;//we want to start at 0, not 1

this could check hasMoreTokens && i == 0 before the loop, so its not checked for every token in the document. in a similar sense I think the "correct < 0 to 0" check should probably be outside of the loop, since it can only really happen for the first term.

asfimport commented 14 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Yes, the check could move outside the loop. I don't think so for the "correct < 0" but whatever. And the Highlight test fails because the test is based off of a multi-valued field in which the first value is blank, thus triggering the subsequent positions to be off by one from the former behavior. A small readability improvement might be to set fieldState.position to -1 instead of decrementing it because that is exactly the intended effect, which is more clear.

FYI the observable change in behavior in this patch principally results from the "firstToken" boolean part of the condition in which the position is decremented. Remove that part of the condition leaving simply i==0 and you get the former/existing behavior.

Ugh, I've thought about this issue way too much and I'm suddenly having a change of opinion. Arguably the current effect of positions is more consistent... it is as if the starting position (prior to looking at the position increment) is -1, no matter whether the first value has tokens or not. I'm going to achieve my objectives for my project via your suggestion of a single field with a value separator and a custom analyzer. It's a little hacky but I'll then have full control over the positions and the field isn't stored any way. I wish I had the ability to control positions in the indexer here a bit more but the support really isn't there and won't be until perhaps #3524, the issue Mike referenced.

As far as my patch goes, with the small modification to get the existing behavior, I think it's an improvement to readability over the current code. I can submit an update if desired.

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1