Start offset going backwards has a legitimate purpose [LUCENE-8776]

asfimport commented 5 years ago

Here is the use case where startOffset can go backwards:

Say there is a line "Organic light-emitting-diode glows", and I want to run span queries and highlight them properly.

During index time, light-emitting-diode is split into three words, which allows me to search for 'light', 'emitting' and 'diode' individually. The three words occupy adjacent positions in the index, as 'light' adjacent to 'emitting' and 'light' at a distance of two words from 'diode' need to match this word. So, the order of words after splitting are: Organic, light, emitting, diode, glows.

But, I also want to search for 'organic' being adjacent to 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'.

The way I solved this was to also generate 'light-emitting-diode' at two positions: (a) In the same position as 'light' and (b) in the same position as 'glows', like below:

organic	light	emitting	diode	glows
	light-emitting-diode		light-emitting-diode
0	1	2	3	4

The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets are obviously the same. This works beautifully in Lucene 5.x in both searching and highlighting with span queries.

But when I try this in Lucene 7.6, it hits the condition "Offsets must not go backwards" at DefaultIndexingChain:818. This IllegalArgumentException is being thrown without any comments on why this check is needed. As I explained above, startOffset going backwards is perfectly valid, to deal with word splitting and span operations on these specialized use cases. On the other hand, it is not clear what value is added by this check and which highlighter code is affected by offsets going backwards. This same check is done at BaseTokenStreamTestCase:245.

I see others talk about how this check found bugs in WordDelimiter etc. but it also prevents legitimate use cases. Can this check be removed?

Migrated from LUCENE-8776 by Ram Venkat, updated Oct 26 2021 Attachments: LUCENE-8776-proof-of-concept.patch

asfimport commented 5 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The check is important for several reasons:

with offsets not going backwards: highlighters can at a high-level, easily process a document once in forwards order, regardless of whether they are decoding bits from the postings list, re-analyzing text with a token stream, or decoding bits from term vectors. If offsets go backwards, then parts of highlighting that should be O\(n) easily become more like {{O(n^2)}, especially as documents get larger.
with offsets not going backwards: offsets can be encoded into the index as non-negative integer deltas, and ordinary index integer compression (bitpacking/FOR, vbyte) works just fine. Having to deal with negative deltas would make the encoding more difficult: probably to the point of making them a non-option for the postings lists.

asfimport commented 5 years ago

Ram Venkat (migrated from JIRA)

Robert - Thanks for the quick, detailed response.

If negative deltas of offsets breaks postings, shouldn't the check be included only if postings are used?

If performance gets worse for large documents, isn't it better to just log a warning, rather than completely remove that feature? Net performance depends on other factors like hardware, right?

We use the default highlighter with term vectors. Functionally, offsets going backwards works well as I explained in my previous post. We do extensive performance tests and we do not have an issue there either.

At this point, we are forced to remove this check and recompile the source. Instead, can we move this check to where postings are used?

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'd rather keep this check for the reasons Robert mentioned. Only enabling the check when offsets are indexed doesn't sound like a great trade-off to me as we'd be accepting a broken token stream that would only not cause trouble because offsets are not indexed. Can you either give light, emitting and diode the same offsets as "light-emitting-diode" or give the two additional "light-emitting-diode" tokens you are introducing the same offsets as "light" for the first one and "diode" for the last one?

asfimport commented 5 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If performance gets worse for large documents, isn't it better to just log a warning, rather than completely remove that feature? Net performance depends on other factors like hardware, right?

No, as computer scientists we don't write bad algorithms and tell people to buy more hardware. And as a library logging anything, especially logging a warning rather than enforcing the contract, is wrong to do.

At this point, we are forced to remove this check and recompile the source. Instead, can we move this check to where postings are used?

no.

asfimport commented 5 years ago

Ram Venkat (migrated from JIRA)

Adrien - That will not work as searching for "organic adjacent to light" would highlight the entire word "light-emitting-diode" instead of just "light". And only light or diode gets highlighted when light-emitting-diode is given the same offset as light or diode (when you search for light-emitting-diode).

Robert,

We are not writing any new 'bad" algorithm. We have been using this feature for a while. Allowing offsets to go backwards is an existing feature in Lucene for a long time. This check and exception broke that feature.

And, no, I am not asking anyone to buy more hardware. It's just a figure of speech to say that the net performance depends on many factors and a certain part of code being order of n-square, may or may not affect the net performance, due to many other factors. In our case, it does not. That is all the point I want to make.

Removing a long existing feature in Lucene because (a) it affects a newer feature (postings) which is used by some people or (b) might cause a noticeable performance degradation in some cases, is not a great argument. We are dependent on this feature. We have no alternatives at this point. And, I have proof that it does not affect performance in a noticeable way, with extensive testing in our environment/data etc. Plus, I am guessing that we are not the only one in the world using this feature.

For these reasons, we should either move this check and exception to other parts of Lucene (without affecting indexing and standard highlighter) or remove it.

asfimport commented 5 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Ram, it's good that this solution worked for you, but taking a step back, your solution seems like a workaround for #8451 and #5380. Workarounds aren't inherently bad of course, but when they depend on ambiguity of (or lack of enforcement of) contracts, backward compatibility can't be guaranteed (to paraphrase what I take Robert and Adrien to be saying).

Of course, one person's "patch" is another person's "workaround", but I'd be curious to know whether any of the "LUCENE-7398/*" branches might help for your use case. (There's a high-level description in this comment on the #8451 issue). Particularly relevant to this discussion: the patch supports recording token positionLength in the index, and enforces index ordering by startPosition and endPosition (compatible with ordering specified for the Spans API).

asfimport commented 5 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think your use case can be properly handled as a token graph, without offsets going backwards, if you set proper PositionLengthAttribute for each token; indeed it's for exactly cases like this that we added PositionLengthAttribute.

Give your light-emitting-diode token PositionLengthAttribute=3 so that the consumer of the tokens knows it spans over the three separate tokens (light, emitting and diode).

To get correct behavior you must do this analysis at query time, and Lucene's query parsers will properly interpret the resulting graph and query the index correctly. Unfortunately, you cannot properly index a token graph: Lucene discards the PositionLengthAttribute which is why if you really want to index a token graph you should insert a FlattenGraphFilter at the end of your chain. This still discards information (loses the graph-ness) but tries to do so minimizing how queries are broken.

asfimport commented 5 years ago

Ram Venkat (migrated from JIRA)

Michael G and Michael M:

We do not use Lucene's standard query parser. We have made significant enhancements to Surround Parser (several thousand lines of code), and that is our primary parser. I can see how I will pass the PositionLengthAttribute and change the adjacency distance in the query. But I have to implement that in SurroundParser and I can do that sometime in the future and contribute it.

So, back to my original point: This check broke existing applications with valid use cases, without providing a workaround. A simple way to bypass the check would be sufficient for our purposes. I believe that we should make that change.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I am sorry that this change broke your use-case, but rejecting backward offsets still sounds like a better trade-off to me.

asfimport commented 5 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Unfortunately, you cannot properly index a token graph: Lucene discards the PositionLengthAttribute which is why if you really want to index a token graph you should insert a FlattenGraphFilter at the end of your chain. This still discards information (loses the graph-ness) but tries to do so minimizing how queries are broken.

Would there be any interest revisiting #5380: adding support for indexed position length (PositionLengthAttribute)?

Since Ram's is an index-time use case, I see only two options:

@mikemccand's suggestion, which would compromise phrase query accuracy (e.g., missing "light-emitting-diode glows"), and
@jpountz's initial suggestion, which would compromise highlighting precision.

For graph TokenStreams, indexed position length could be used to fully address (as opposed to minimize) "how queries are broken", in addition to avoiding the tradeoff/compromise in the case described here. It would also enable index-time multi-term synonyms, etc ...

asfimport commented 5 years ago

Ram Venkat (migrated from JIRA)

Adrien,

There was no trade-off required here. Postings and negative offsets could have peacefully coexisted, with different implementations using one or the other. A choice was made knowingly or unknowingly to break some implementations, to the benefit of a new feature.

Michael G,

I like the PositionLengthAttribute. But, like I said earlier, that still means that all the relevant query parsers have to make use of the positionLength, before it can replace the negative offsets feature. Till that happens, this check cannot be forced on the users.

asfimport commented 5 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ram Venkat I'm sorry this change broke your use case.

I think allowing backwards offsets was an accidental but longstanding bug in prior versions of Lucene. It is unfortunate your code came to rely on that bug, but we need to be able to fix our bugs and move forwards.

@magibney a 3rd option in your list would be for Ram Venkat to fix his query parser to properly consume the graph, and generate fully accurate queries, the way Lucene's query parsers now do. Then you can have precisely matching queries, no bugs.

asfimport commented 5 years ago

Ram Venkat (migrated from JIRA)

@mikemccand - You are mischaracterizing a long standing Lucene feature as a "bug". Offsets going backwards worked exactly as we wanted. But, it's not worth getting into that cliche. I am not saying that this feature should not be retired, if it adds great value to do so. But, users should be given the time to migrate their implementations to use alternate methods. That is just a standard practice in maintaining any product or library, especially a mature library like Lucene. Hence, there should be a significant period of time, where users can bypass that check that prevents indexing such documents (with negative offsets). About us enhancing our query parser, it is not trivial. I am not sure whether Lucene standard query parser (or whatever you are referring to), can deal with the combination of wildcards and term distance. For example, "light* adjacent_to glows" should match "light-emitting-diode glows". This can be done in our parser, but just not a small enough task for us to do as part of a version upgrade. This is why we need time to do this.

asfimport commented 5 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Ram Venkat I do understand your frustration. Believe me, we don't take changes like this easily. One persons bug is another persons feature and as we grow and mature strong guarantess are essential for a vast majority of users, for future developments for faster iterations and more performant code. There might not be a tradeoff from your perspective, from the maintainers perspective there is. Now we can debate if a major version bump is enough time to migrate or not, our policy is that we can make BWC and behavioral changes like this in a major release. In-fact we don't do it in minors to provide you the time you need and to easy upgrades to minors. We will and have build features on top of this guarantee and in order to manage expectations I am pretty sure we won't go back an allow negative offsets. I think your best option, if you like it or not, is to work towards a fix for your issue with either the tools you have now or improve lucene for instance with the suggestion from @magibney regarding indexing more information.

Please don't get mad at me, I am just trying to manage expectations.

asfimport commented 5 years ago

Ram Venkat (migrated from JIRA)

Simon Willnauer - I am frustrated with this change to Lucene, but not at any of you. I am actually quite grateful to each of you for taking the time to respond to my comments. Fact is that I am just a user complaining about a free library, without having made any contribution. I hope to contribute the improvements I have made in the future, if/when my clients approve. To summarize this discussion:

Disallowing negative offsets is a permanent, irrevocable change to Lucene. Users will not be allowed to workaround this new restriction, although it affects only those who use postings (Unified Highlighter), at this time.
This change breaks the ability to split words at index time into multiple tokens and correctly search and highlight them with span queries, especially in conjunction with wildcards.There is no easy alternative and the solution requires a significant enhancement to query parsers.

Can we at least either add these comments to the code or refer this jira issue from the exception thrown? If you agree, I can submit a patch.

asfimport commented 5 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think we should change the exception message at all. Please remember: the code is open source, you can modify the source code for whatever purpose.

So I think such a message comes out wrong, I don't even buy the argument that it "broke" anything: we're not putting a gun to your head and forcing you to upgrade or anything. You can stay on an older version of the code, you can fork the code and modify it, all kinds of solutions.

asfimport commented 5 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Ram Venkat I am curious if you've tried the latest version of the UnifiedHighlighter that can be configured to use the "WeightMatches" API. This is toggled via the org.apache.lucene.search.uhighlight.UnifiedHighlighter.HighlightFlag#WEIGHT_MATCHES flag. This isn't the default at the Lucene level but ought to be changed to be. You may notice some highlighting differences, like for phrases and spans in which phrases as a whole get one pair of open/close tags instead of the constituent words. There's a chance that this mode ameliorates your highlighting woes with offsets.

asfimport commented 5 years ago

Ram Venkat (migrated from JIRA)

@dsmiley I have not tried the UnifiedHighlighter yet. Phrases getting a highlight tag as a whole is a very useful and cool feature.

My problem is that I cannot tell the highlighter that I want the highlight to begin at an earlier offset than the offset of the previous token. Is there a way to communicate that information to the UnifiedHighlighter?

asfimport commented 5 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I think your response/question is about tokenstream contracts with offsets and not about highlighting, and that's well established territory already discussed in this thread.

asfimport commented 5 years ago

Ram Venkat (migrated from JIRA)

@dsmiley It would be challenging to discuss highlighting, in the context of this thread, without also discussing offsets. If you are asking me whether that particular feature in UnifiedHighlighter will solve my problem, based on my current understanding, answer would be 'no'. I can explain more if you want me to. To solve my problem, either offsets have to go backwards or I have to rewrite my search logic to treat adjacency on either side differently.

asfimport commented 4 years ago

Roman Chyla (migrated from JIRA)

I too suffer from the same issue, we have multi-token synonyms that can even overlap. I recognize the arguments against the backward offsets but I find them surprisingly backwards: they are saying that the implementation dictates function. When the function is what (for many people) is the goal. The arguments seem also to say that the most efficient implementation (non-negative integer deltas) does not allow backward offsets, therefore backward offsets is a bug.

Please recognize, that the most elegant implementation sometimes mean "as complex as needed" – it is not the same as "the simplest". If negative vints consume 5 bytes instead of 4, some people need to and are willing to pay that price. Their use cases cannot be simply 'boxed' into the world where one is only looking ahead and never back (NLP is one such world)

Lucene is however inviting one particular solution:

The implementation of vint seems not mind if there is a negative offset (https://issues.apache.org/jira/browse/LUCENE-3738) and DefaultIndexingChain extends DocConsumer – the name 'Default' suggests that at some point in the past, Lucene developers wanted to provide other implementations. As it is right now, it is not easy to plug in a different 'DocConsumer' – that surely seems like an important omission! (one size fits all?).

So if we just add a simple mechanism to instruct Lucene which DocConsumer to use, then all could be happy and not have to resort to dirty hacks or forks. The most efficient impl will be the default, yet will allow us us - dirty bastards - shoot ourselves in foot if we so desire. SOLR as well as ElasticSearch devs might not mind having the option in the future - it can come in handy. Wouldn't that be wonderful? Well, wonderful certainly not, just useful... could I do it? @rmuir @mikemccand @s1monw

asfimport commented 4 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think preventing negative offsets is more than just allowing people to pay the price of 5 bytes per negative offset. There are many more things than compression based on this behavior. Just a few that come to my mind right away:

Correctness verification of indices, if we see a negative offset the index must be broken.
Catch broken analyzers early
have clear bounds of what an offset can be and in what direction it can grow. This helps tons if you want to implement future features that can rely on this behavior.

So if we just add a simple mechanism to instruct Lucene which DocConsumer to use, then all could be happy and not have to resort to dirty hacks or forks. The most efficient impl will be the default, yet will allow us us - dirty bastards - shoot ourselves in foot if we so desire. SOLR as well as ElasticSearch devs might not mind having the option in the future - it can come in handy. Wouldn't that be wonderful? Well, wonderful certainly not, just useful... could I do it?

if you feel like you wanna implement this, go ahead. I am sure there will be more issues like Check Index will not work anymore etc. There might be future features you will break with doing this. But again it's all open source, nobody forces you to upgrade or use the code we package directly. You can download the source and modify it that' is all up to you.

We as a community need to make decisions that sometimes don't work for everybody. We have a great responsibility with a project like Lucene being used in an unbelievable wide range of applications and that sometimes means to add restrictions. We don't take this easily, it's almost always a hard decisions. Having offsets going forward only will be a win for a large majority and that's why we keep on having this restrictions. I am totally sorry for you struggle.

Talking about extending DocConsumer, I am torn it should be a extension point. I know that there is an implementation that uses it out there but if I could pick I'd remove this extension since it's, as you say, way to hard to get it right.

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I just wanted to chip in a lateral idea that crossed my mind - if these offsets are required only for highlighting then one could get around the problem stated in the description of this issue by indexing positions only and retrieving hit position offsets using matches API. These positions would have to be retrieved using a different attribute than offsets attribute but that attribute could go backwards... Yes, it would require re-parsing the content for highlighting but it would also be a clean and elegant solution to the original problem without the need to hack Lucene internals.

There is a fresh set of components that could be used for building such a highlighter on top of matches API in #10479. All that would be needed is a specific implementation of position-to-offsets strategy that retrieves those arbitrary offsets for a token's position. I think this could work. Just a thought.

asfimport commented 4 years ago

Roman Chyla (migrated from JIRA)

@s1monw no doubt that the decisions are not always easy, I appreciate the attention you are giving the matter. The arguments presented here are meant to provide more information and if you decide that the matter at hand has no merit, there is no point arguing (and no bad blood). It is actually easier for us to fork Lucene – but it seems (and the stress is 'seems') wrong for Lucene to intentionally limit itself in what it was doing so well, so I'm trying to nudge the case.

Consider this: I have disabled the checks in DefaultIndexingChain and rerun full suite of tests, these tests became failing:

7.7: org.apache.lucene.index.TestPostingsOffsets

8.6: org.apache.lucene.index.TestPostingsOffsets

You'll notice that the only new tests failing are those that enforce the check. Maybe there are no tests written for index integrity?

As to the issue at had: whether the implementation/extension can be made 'pluggable'. My view is following: if you constraint what is already in Lucene, you are forcing people to make forks. We are somewhere inbetween - it would be easy to provide option to plug a custom chain. It would cost little to give them the option (with BIG NEON WARNINGS pasted all over if necessary).

Ok, that's an argument by practicality – not a strong one. But how about the "nothing is broken" part? (yes, the tests that enforce the condition are failing – but nothing else is broken) . User cases are broken: there are already two examples of projects that got this complex scenario right (our project is one such example) - I asked in the forum, and @dsmiley and @gh_at struggled in their work with the same issue. I'm not meaning to drag them in for them to weigh in (but I wouldn't mind obviously ;)), I'm just trying to illustrate that the limitations break real-case scenarios.

And the benefits still seem to be in the realm of " future possibilities". Sure, that is not to be dismissed lightly, they are important concerns. But if we as engineers choose the most efficient over the most optimal, we would always be building "houses" without windows (these things loose energy, make people fall from heights, require cleaning - are incredibly wasteful!)

The next thing I could test is to run a performance test with a tokenizer chain which allows backward postings and the one which employes the flatten tokenizer and report results. But I'm going to do only if it the case is really open for consideration, otherwise it would be a waste of time.

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I really like the relatively new offset ordering constraint as a default for some of the reasons Simon gave. There's this sensibility to it; it just makes sense intuitively in a way that needs no defense. I upgraded a bunch of old code to follow this rule and I'm happier with the upgraded version of those components than the prior behavior. But then there's some interesting/advanced cases where the rule is simply impossible to follow. What I'd like to see is a way for expert users to toggle this off. Perhaps a setting on the Analyzer passed to IndexWriterConfig or just some other setting on IndexWriterConfig. Or maybe a read-only setting on the OffsetAttribute (requiring a custom impl)? That'd be my preference. I think a pluggable IndexingChain thing seems too invasive / too difficult to get right. I'm willing to roll up my sleeves and make this setting happen if I know in advance I'm not going to be vetoed on principle.

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But then there's some interesting/advanced cases where the rule is simply impossible to follow.

I am still struggling to understand what use cases are actually made impossible by IndexWriter's enforcing of valid offsets.

I sent a reply to Roman Chyla on the dev list to try to understand those examples better.

While the cost of writing a negative vInt was one of the motivations for stricter offset checking, another (stronger?) motivation was catching buggy analysis components that were producing invalid offsets, causing problems only detected much later (at search time). IndexWriter's offset enforcement allows detecting such buggy TokenFilters earlier.

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't think it can be rearranged for both positions and offsets to be non-decreasing given the positions are fixed (so that they match on spans)?

organic light emitting diode glows
|    |    |    |    |    |    |    |
0    5    10   15   20   25   30   3 5

pos  term(s)               offset (inclusive)
0    organic               0-6
1    light                 8-12
1    light-emitting-diode  8-27
2    emitting              14-21
3    diode                 23-27
3    light-emitting-diode  8-27
4    glows                 29-33

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

LUCENE-8776-proof-of-concept.patch is a proof-of-concept of what I had in mind when I mentioned the new match highlighter could source arbitrary offsets for positions. Not everything is perfect though since the default position-to-offsets strategy doesn't really know which term at a given position was the actual match, so it takes the longest possible range. This means a query for "light" would highlight "light-emitting-diode" because they are indexed at the same position. Maybe it could be improved with term matching internally... but it's not the point you're trying to make so I'm not going to look any further.

I agree with you that with the current indexing chain I don't know how to achieve the token stream with the positions and offsets you showed.

LUCENE-8776-proof-of-concept.patch

```diff diff --git a/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/CustomOffsetsFromPositions.java b/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/CustomOffsetsFromPositions.java new file mode 100644 index 00000000000..4e02ee28ad8 --- /dev/null +++ b/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/CustomOffsetsFromPositions.java @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.matchhighlight; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.MatchesIterator; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.QueryVisitor; +import org.apache.lucene.util.BytesRef; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; + +/** + * Retrieve highlights from custom offset attribute for each position. + */ +public final class CustomOffsetsFromPositions implements OffsetsRetrievalStrategy { + private final String field; + private final Analyzer analyzer; + + CustomOffsetsFromPositions(String field, Analyzer analyzer) { + this.field = field; + this.analyzer = analyzer; + } + + @Override + public List get(MatchesIterator matchesIterator, MatchRegionRetriever.FieldValueProvider doc) + throws IOException { + ArrayList ranges = new ArrayList<>(); + + while (matchesIterator.next()) { + int from = matchesIterator.startPosition(); + int to = matchesIterator.endPosition(); + if (from < 0 || to < 0) { + throw new IOException("Matches API returned negative positions for field: " + field); + } + ranges.add(new OffsetRange(from, to)); + } + + // Convert from positions to offsets. + ranges = convertPositionsToOffsets(ranges, analyzer, field, doc.getValues(field)); + + return ranges; + } + + @Override + public boolean requiresDocument() { + return true; + } + + private static ArrayList convertPositionsToOffsets( + ArrayList ranges, + Analyzer analyzer, + String fieldName, + List values) + throws IOException { + + if (ranges.isEmpty()) { + return ranges; + } + + class LeftRight { + int left = Integer.MAX_VALUE; + int right = Integer.MIN_VALUE; + + @Override + public String toString() { + return "[" + "L: " + left + ", R: " + right + ']'; + } + } + + Map requiredPositionSpans = new HashMap<>(); + int minPosition = Integer.MAX_VALUE; + int maxPosition = Integer.MIN_VALUE; + for (OffsetRange range : ranges) { + requiredPositionSpans.computeIfAbsent(range.from, (key) -> new LeftRight()); + requiredPositionSpans.computeIfAbsent(range.to, (key) -> new LeftRight()); + minPosition = Math.min(minPosition, range.from); + maxPosition = Math.max(maxPosition, range.to); + } + + int position = -1; + int valueOffset = 0; + for (int valueIndex = 0, max = values.size(); valueIndex < max; valueIndex++) { + final String value = values.get(valueIndex).toString(); + final boolean lastValue = valueIndex + 1 == max; + + TokenStream ts = analyzer.tokenStream(fieldName, value); + TestCustomTokenOffsets.CustomOffsetAttribute offsetAttr = ts.getAttribute( + TestCustomTokenOffsets.CustomOffsetAttribute.class); + PositionIncrementAttribute posAttr = ts.getAttribute(PositionIncrementAttribute.class); + CharTermAttribute charTerm = ts.getAttribute(CharTermAttribute.class); + ts.reset(); + while (ts.incrementToken()) { + position += posAttr.getPositionIncrement(); + + if (position >= minPosition) { + LeftRight leftRight = requiredPositionSpans.get(position); + if (leftRight != null) { + int startOffset = valueOffset + offsetAttr.startOffset(); + int endOffset = valueOffset + offsetAttr.endOffset(); + + leftRight.left = Math.min(leftRight.left, startOffset); + leftRight.right = Math.max(leftRight.right, endOffset); + } + + // Only short-circuit if we're on the last value (which should be the common + // case since most fields would only have a single value anyway). We need + // to make sure of this because otherwise offsetAttr would have incorrect value. + if (position > maxPosition && lastValue) { + break; + } + } + } + ts.end(); + position += posAttr.getPositionIncrement() + analyzer.getPositionIncrementGap(fieldName); + valueOffset += offsetAttr.endOffset() + analyzer.getOffsetGap(fieldName); + ts.close(); + } + + ArrayList converted = new ArrayList<>(); + for (OffsetRange range : ranges) { + LeftRight left = requiredPositionSpans.get(range.from); + LeftRight right = requiredPositionSpans.get(range.to); + if (left == null + || right == null + || left.left == Integer.MAX_VALUE + || right.right == Integer.MIN_VALUE) { + throw new RuntimeException("Position not properly initialized for range: " + range); + } + converted.add(new OffsetRange(left.left, right.right)); + } + + return converted; + } +} diff --git a/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestCustomTokenOffsets.java b/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestCustomTokenOffsets.java new file mode 100644 index 00000000000..32ad41fb10f --- /dev/null +++ b/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestCustomTokenOffsets.java @@ -0,0 +1,311 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.matchhighlight; + +import com.carrotsearch.randomizedtesting.RandomizedTest; +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.Token; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; +import org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl; +import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; +import org.apache.lucene.document.StringField; +import org.apache.lucene.document.TextField; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.IndexWriterConfig; +import org.apache.lucene.index.IndexableField; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.MatchesIterator; +import org.apache.lucene.search.PhraseQuery; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.store.ByteBuffersDirectory; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.Attribute; +import org.apache.lucene.util.AttributeImpl; +import org.apache.lucene.util.AttributeReflector; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.LuceneTestCase; +import org.junit.Before; +import org.junit.Test; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.TreeMap; + +import static org.hamcrest.Matchers.*; + +public class TestCustomTokenOffsets extends LuceneTestCase { + private static final String FLD_ID = "field_id"; + private static final String FLD_TEXT_POS = "field_text"; + + private Analyzer analyzer; + + public interface CustomOffsetAttribute extends Attribute { + public int startOffset(); + public void setOffset(int startOffset, int endOffset); + public int endOffset(); + } + + public static class CustomOffsetAttributeImpl + extends AttributeImpl implements CustomOffsetAttribute { + int start, end; + + @Override + public int startOffset() { + return start; + } + + @Override + public void setOffset(int startOffset, int endOffset) { + this.start = startOffset; + this.end = endOffset; + } + + @Override + public int endOffset() { + return end; + } + + @Override + public void clear() { + start = end = 0; + } + + @Override + public void reflectWith(AttributeReflector reflector) { + throw new UnsupportedOperationException(); + } + + @Override + public void copyTo(AttributeImpl target) { + throw new UnsupportedOperationException(); + } + } + + private static class CustomTokenStream extends TokenStream { + private final Token[] tokens; + private int upto = 0; + private final CustomOffsetAttribute offsetAtt = addAttribute(CustomOffsetAttribute.class); + private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class); + private final CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class); + private final int finalOffset; + private final int finalPosInc; + + public CustomTokenStream(Token... tokens) { + super(Token.TOKEN_ATTRIBUTE_FACTORY); + this.tokens = tokens; + this.finalOffset = 0; + this.finalPosInc = 0; + } + + @Override + public void end() throws IOException { + super.end(); + posIncrAtt.setPositionIncrement(finalPosInc); + offsetAtt.setOffset(finalOffset, finalOffset); + } + + @Override + public void reset() throws IOException { + upto = 0; + super.reset(); + } + + @Override + public boolean incrementToken() { + if (upto < tokens.length) { + clearAttributes(); + // NOTE: this looks weird, casting offsetAtt to Token, but because we are using the Token class's AttributeFactory, all attributes are + // in fact backed by the Token class, so we just copy the current token into our Token: + Token token = tokens[upto++]; + offsetAtt.setOffset(token.startOffset(), token.endOffset()); + posIncrAtt.setPositionIncrement(token.getPositionIncrement()); + charTermAttribute.setLength(0); + charTermAttribute.append(token); + return true; + } else { + return false; + } + } + } + + private static class CustomAnalyzer extends Analyzer { + @Override + protected TokenStreamComponents createComponents(String fieldName) { + // Always produces a constant token stream, just for the test. + Token[] tokens = { + new Token("organic", 1, 0, 7), + new Token("light", 1, 8, 13), + new Token("light-emitting-diode", 0, 8, 28), + new Token("emitting", 1, 14, 22), + new Token("diode", 1, 23, 28), + new Token("light-emitting-diode", 0, 8, 28), + new Token("glows", 1, 29, 34) + }; + return new TokenStreamComponents((reader) -> {}, new CustomTokenStream(tokens)); + } + } + + @Before + public void setup() { + Map fieldAnalyzers = new HashMap<>(); + fieldAnalyzers.put(FLD_TEXT_POS, new CustomAnalyzer()); + analyzer = new PerFieldAnalyzerWrapper(new MissingAnalyzer(), fieldAnalyzers); + } + + @Test + public void testAnyOffsets() throws IOException { + String field = FLD_TEXT_POS; + + withReader( + List.of( + Map.of(field, values("organic light emitting diode glows"))), + reader -> { + // Note that the match on 'light' highlights the largest offset + // range indexed over the match range's position. In this and the + // next case all the way up to 'light emitting diode'. + assertThat(highlights(fld -> new CustomOffsetsFromPositions(fld, analyzer), + reader, new TermQuery(new Term(field, "light"))), + containsInAnyOrder( + fmt("0: (%s: 'organic >light emitting diode< glows')", field))); + + assertThat(highlights(fld -> new CustomOffsetsFromPositions(fld, analyzer), + reader, new PhraseQuery(0, field, "diode", "glows")), + containsInAnyOrder( + fmt("0: (%s: 'organic >light emitting diode glows<')", field))); + + // Note 'double' highlight over light emitting diode as it actually + // matches on two positions with the same offsets. + assertThat(highlights(fld -> new CustomOffsetsFromPositions(fld, analyzer), + reader, new TermQuery(new Term(field, "light-emitting-diode"))), + containsInAnyOrder( + fmt("0: (%s: 'organic >>light emitting diode<< glows')", field))); + + // Phrase query over 'backward' offsets synonym at 'diode'. + assertThat(highlights(fld -> new CustomOffsetsFromPositions(fld, analyzer), + reader, new PhraseQuery(0, field, "light-emitting-diode", "glows")), + containsInAnyOrder( + fmt("0: (%s: 'organic >light emitting diode glows<')", field))); + }); + } + + private List highlights(IndexReader reader, Query query) throws IOException { + return highlights(MatchRegionRetriever.computeOffsetRetrievalStrategies(reader, analyzer), + reader, query); + } + + private List highlights(OffsetsRetrievalStrategySupplier offsetsStrategySupplier, + IndexReader reader, Query query) throws IOException { + IndexSearcher searcher = new IndexSearcher(reader); + int maxDocs = 1000; + + Query rewrittenQuery = searcher.rewrite(query); + TopDocs topDocs = searcher.search(rewrittenQuery, maxDocs); + + ArrayList highlights = new ArrayList<>(); + + AsciiMatchRangeHighlighter formatter = new AsciiMatchRangeHighlighter(analyzer); + + MatchRegionRetriever.MatchOffsetsConsumer highlightCollector = + (docId, leafReader, leafDocId, fieldHighlights) -> { + StringBuilder sb = new StringBuilder(); + + Document document = leafReader.document(leafDocId); + formatter + .apply(document, new TreeMap<>(fieldHighlights)) + .forEach( + (field, snippets) -> { + sb.append( + String.format( + Locale.ROOT, "(%s: '%s')", field, String.join(" | ", snippets))); + }); + + if (sb.length() > 0) { + sb.insert(0, document.get(FLD_ID) + ": "); + highlights.add(sb.toString()); + } + }; + + MatchRegionRetriever highlighter = new MatchRegionRetriever(searcher, rewrittenQuery, analyzer, + offsetsStrategySupplier); + highlighter.highlightDocuments(topDocs, highlightCollector); + + return highlights; + } + + private String[] values(String... values) { + assertThat(values, not(emptyArray())); + return values; + } + + private void withReader( + Collection> docs, IOUtils.IOConsumer block) + throws IOException { + IndexWriterConfig config = new IndexWriterConfig(analyzer); + + try (Directory directory = new ByteBuffersDirectory()) { + IndexWriter iw = new IndexWriter(directory, config); + + int seq = 0; + for (Map fields : docs) { + Document doc = new Document(); + doc.add(new StringField(FLD_ID, Integer.toString(seq++), Field.Store.YES)); + for (Map.Entry field : fields.entrySet()) { + for (String value : field.getValue()) { + doc.add(toField(field.getKey(), value)); + } + } + iw.addDocument(doc); + if (RandomizedTest.randomBoolean()) { + iw.commit(); + } + } + iw.flush(); + + try (DirectoryReader reader = DirectoryReader.open(iw)) { + block.accept(reader); + } + } + } + + private IndexableField toField(String name, String value) { + switch (name) { + case FLD_TEXT_POS: + return new TextField(name, value, Field.Store.YES); + default: + throw new AssertionError("Don't know how to handle this field: " + name); + } + } + + private static String fmt(String string, Object... args) { + return String.format(Locale.ROOT, string, args); + } +} ```

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I think I understand the original example in the opening post here.

The light-emitting-diode token is repeated a 2nd time (at position 3) so that the span/phrase query "light-emitting-diode glows" works correctly? But then what about the span/phrase query "organic light-emitting-diode glows", which ought to match but I think even for your workaround (double-indexing light-emitting-diode) will not then work?

I see this solution as working around the fact that positionLength is not indexed in Lucene. Yet, Lucene already offers an accurate way to solve all of this, at query time, by properly consuming the token graph output after tokenizing a query (including positionLength of the tokens) and creating a correct query such that all of the above examples would work correctly, without producing two or more light-emitting-diode tokens.

Are you sure you cannot use the query-time solution, so you get correct positional queries? You are asking us to remove the IndexWriter offset checks so your workaround by double-tokenizing (which is not always correct?) can work again?

asfimport commented 4 years ago

Roman Chyla (migrated from JIRA)

Sorry for crossposting (into the forum and here); I will try to study @dweiss example, but here is some useful writeup – please jump to the last example; where PositionLength attribute would fail us.

assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
         "title", "THE HUBBLE constant: a summary of the hubble space telescope program"));

term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
 term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
 term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
 term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
 term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
 term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
 term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
 term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
 * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
 term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
 term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68

* - fails because of offsetEnd < lastToken.offsetEnd; If reordered (the multi-token synonym emitted as a last token) it would fail as well, because of the check for lastToken.beginOffset < currentToken.beginOffset. Basically, any reordering would result in a failure (unless offsets are trimmed).

The following example has additional twist because ofspace-time; the tokenizer first splits the word and generate two new tokens – those alternative tokens are then used to find synonyms (space == universe)

assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
         "title", "MIT and anti de sitter space-time"));

term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
 term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
 term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
 term=syn::massachusetts institute of technology posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
 term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
 term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
 term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
 term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
 term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
 term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
 term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28
 term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
 term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33

So far, all of these cases could be handled with the new position length attribute. But let us look at a case where that would fail too.

assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
         "title", "Massachusetts Institute of Technology and antidesitter space-time"));

term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
 term=syn::massachusetts institute of technology posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
 term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
 term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
 term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
 term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
 term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
 term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
 term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64

Notice the posLen=4 of MIT; it would cover tokensmassachusetts institute technology antidesitterwhile offsets are still correct.

asfimport commented 4 years ago

Michael Gibney (@magibney) (migrated from JIRA)

I see this solution as working around the fact that positionLength is not indexed in Lucene

I think that indeed describes the motivation behind Ram Venkat's example. For some such cases, restricting graph tokenStreams to query-time would help; but I don't think there's any escaping the fact that there are cases where graph tokenStreams would more appropriately be generated at index time (e.g., using extra context/information that can only be available at index time). There's always #5380 :-) ...

I'm also still struggling to understand Roman Chyla's examples, which re-focused attention on this issue; like @mikemccand, I don't see why any of these examples would fail based on the offset constraints in DefaultIndexingChain. AFAICT, there is no "offsetEnd < lastToken.offsetEnd" restriction; each startOffset in the examples above is >= the preceding startOffset, and each endOffset is >= its corresponding startOffset. I feel like I'm missing something. Roman Chyla, can you share your actual analysis chain config?

No matter what startOffset/endOffsets your tokens have, even if they get really crazy, it should be possible (at least in principle) to order tokens such that the constraint of "no backward startOffsets" holds, no? I think, as @dsmiley hints at, where this could (perhaps?) become impossible is if such ordering would conflict with ordering based on positions.

To my mind (echoing @s1monw's "have clear bounds of what an offset can be and in what direction it can grow"): the main benefit of formalizing/enforcing the order of tokens/offsets in a tokenStream is so that consumers (including indirect consumers, e.g. via postings at query-time) know what to expect. Even an index that's not strictly-speaking "corrupt" can be made more useful/efficient if types of order that can be enforced are enforced.

Two main questions occur to me:

Are there use cases that truly cannot be supported (even in principle, never mind with the current state of analysis components) with strict ordering constraints based on token offsets, positions, and maybe positionLengths?
Is there enough existing functionality that people have built around the historic lack of constraints that the horse has left the barn, and the ability to toggle this behavior off should be provided, absent some practical compulsion otherwise (e.g., actual index incompatibility/corruption, as opposed to simply sanity-checking input)?

Perhaps a bit off-topic, but ideally I could see:

indexed positionLength
strict ordering of tokens by increasing position (this is the case by definition, I think?), and for a given position, order by increasing positionLength
offsets compatible with positions such that the above position-based ordering would also result in ordering of tokens by increasing startOffset (perhaps even adding the constraint that for a given startOffset, endOffset would never decrease?)

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

The light-emitting-diode token is repeated a 2nd time (at position 3) so that the span/phrase query "light-emitting-diode glows" works correctly?

This is the way I understood the original example at least?

But then what about the span/phrase query "organic light-emitting-diode glows", which ought to match but I think even for your workaround (double-indexing light-emitting-diode) will not then work?

I think you're correct.

Yet, Lucene already offers an accurate way to solve all of this, at query time, by properly consuming the token graph output after tokenizing a query (including positionLength of the tokens) and creating a correct query such that all of the above examples would work correctly, without producing two or more light-emitting-diode tokens

I never played with such complex graphs (redface). How would this work in indexing/ at query time? Can you write up a test case for the above, Mike?

asfimport commented 4 years ago

Ram Venkat (migrated from JIRA)

@mikemccand The solution we had worked in our case and satisfied several thousand power users for their daily needs, because they did not care about the three word adjacency issue. I wonder whether we are over-complicating this issue. At index time we and others generate hidden tokens. Our purpose is to match certain span queries. Others (like Roman Chyla ]) may do it for multi word synonyms. There are so many use cases for hidden tokens that we all know. When we generate such hidden tokens, we have to make sure that highlighting still works correctly. To this effect, we need to have full control over setting the offsets. "Offsets must not decrease" rule took away our power/control over highlighting.

bq. Are you sure you cannot use the query-time solution, so you get correct positional queries? You are asking us to remove the IndexWriter offset checks so your workaround by double-tokenizing (which is not always correct?) can work again?

If you scroll above, you can see that we actually discussed this a year ago. It gets complicated because of wildcards and adjacency. "Impossibility" criteria is always difficult to prove.

asfimport commented 4 years ago

Roman Chyla (migrated from JIRA)

@magibney I'll dig for more concrete examples but when you pointed out #5380 - that position length is not indexed – then it becomes little bit of an academic exercise. Because if position length is not indexed, then there is no hope to reconstruct the token stream; the highlighter has no information about what tokens were part of the multi-token synonym. The position length could substitute offsets, but I'm also showing in the last example where it fails (it doesn't have access to the stop words that got removed)

So to rely on a position length, one cannot know the span of the tokens that made the resulting multi-token synonym (in other words: position length cannot substitute offsets; even though it is proposed to solve the same thing)

Check this place in the indexing chain – the previous (inverterState) start and ends play role in the checks - multi-token synonyms always span more words, so in their case (to pass here) their offsets must be trimmed (or offsets of the words that made them enlarged – but that would be also weird)

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L917

Finally, in my experience query-time synonym expansion is not enough - we have to do both. Yes, it is very subtle, but I have tried hard not to (but I can be proven wrong, if somebody tries harder :))

here is the analyzer chain: https://github.com/romanchyla/montysolr/blob/master/contrib/examples/adsabs/server/solr/collection1/conf/schema.xml#L406

and the accompanying unittest: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L62

on that line you'll see the reasoning why we are doing both query and index-time expansion

we also do query optimization - that some might find crazy - to avoid score inflation (maybe you have noticed a situation when a query expansion emits multiple synonyms; the score by default is a product of the many synonyms – you could call that score inflation). Because we can't rely on index-time expansion fully, we'll also expand at query time and pick from synonyms the one that is more frequent or less frequent (based on the user query); this optimization is only possible because we have indexed them all – I don't think you could do that in query time only

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L293

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L377

But I seem to be digressing, these things are only important if we were inclined to believe that query-time synonym expansion can solve it for all.

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

(from my message to the dev list last week): Where I work, I've seen an interesting approach to mixed language text analysis in which a sophisticated Tokenizer effectively re-tokenizes an input multiple ways by producing a token stream that is a concatenation of different interpretations of the input. On a Lucene upgrade, we had to "coarsen" the offsets to the point of having highlights that point to a whole sentence instead of the words in that sentence :-(

I'll make up an example in a generic way. Our actual tests are unintelligible to me as they use a bunch of CJK characters that are foreign to me :-)

Imagine input chars "abcdef" and two tokenizers, each emitting simple linear chains (no graphs). Tokenizer1 emits: "abc", "def". Tokenizer2 emits: "ab", "cd", "ef". I believe it's impossible to combine both tokenizers in such a way that preserves both position and offset constraints. One of them has to break. If Lucene would let offsets go backwards (as it used to allow it), we simply concatenate the streams – tokenizer1's tokens then tokenizer2's tokens. It's important that someone writing in the "language" of tokenizer1 can do position sensitive queries for say "abcdef" and likewise for someone writing in the language of tokenizer2 to do do queries of "abcd" or "cdef" and have them work.

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

@dsmiley I am pretty sure your example would actually work today.

You would just need your tokenizer to interleave the tokens.

Remember that there is no requirement that all tokens in one sub-path be enumerated before all tokens in another sub-path. So your tokenizer would emit something like this: abc, ab, cd, def, ef. This permutation should also work: ab, abc, cd, def, ef.

Of course, since Lucene does not record position length in the index, in order to get something close to reasonable for positional queries, you would need to use FlattenGraphFilter or something equivalent to squash the tokens from the two side paths on top of each other.

I think there is an issue where someone made a prototype that indexed position length into payloads and then did the right thing at search time. There was also a talk about it at Buzzwords last summer, I think. Does anyone know this issue?

If you ask git to enumerate all commits, sorted by commit timestamp, how does it serialize its DAG? I think in general it is always possible for git to do this (correctly sorted), regardless of how complex the commit DAG turns out, and should then also work for Lucene token streams, as long as the token stream is free to assign positions, similarly to how git is free to assign commit hashes. And Lucene's offsets are like git's timestamps. I think?

How would this work in indexing/ at query time? Can you write up a test case for the above, Mike?

Actually, @jimczi is the expert on how our queryparsers respect the full token graph and produce correct queries!

A test case should not be hard to make – create a query parser with an analyzer that returns a CannedTokenStream producing the full token graph, then parse any text and see what query it produces?

asfimport commented 4 years ago

Michael Gibney (@magibney) (migrated from JIRA)

I think there is an issue where someone made a prototype that indexed position length into payloads and then did the right thing at search time. There was also a talk about it at Buzzwords last summer, I think.

@mikemccand I think you're talking about the work I put in on #8451 (and the related talk at Buzzwords). Quick update there: that work has been ported forward through 8.3 (due to be ported forward again soon I hope), and it's been solidly running in production for several years. That branch has a bunch of test cases added to TestSpanCollection.java that are relevant to the discussion here (particularly wrt the problem that the double-emitted "light-emitting-diode" token workaround seeks to address).

I'm curious about the multi-tokenizer use case; off the top of my head I can see that it would save some terms dictionary space (for tokenizers that generate a lot of tokens in common), and could help reduce overall field count, if that's an issue. Single-language positional queries (e.g. spans, intervals) should work, but mixed-language positional queries would not (not a problem, but could be useful in some circumstances?). Interleaved tokens (as opposed to concatenated streams) at query-time would probably work, and would probably "work" at index-time in the sense that they could be indexed within the constraints in DefaultIndexingChain – but index-time interleaved tokens would I think hit unexpected behavior pretty quickly for positional queries (spans, intervals) unless the graph structure (i.e., at least positionLength) were indexed ...

position length cannot substitute offsets; even though it is proposed to solve the same thing

Indeed, positionLength is not a general-purpose substitute for offsets; I never intended to suggest otherwise. But #5380 and the #8451 work is relevant to this issue because it presents a more robust alternative to Ram's double-token-emission workaround for positional queries.

Roman, I gather that your case is different, but unfortunately I still can't really pinpoint the problem. I see the checks in the DefaultIndexingChain code, I see that the previous invertState startPosition is part of validation, I acknowledge the need for both index-time and query-time expansion (this was the main motivation for my work on #8451!)... but the startOffsets in the examples you sent don't decrease, and the endOffset of each token is greater than its startOffset, so I don't see anything that would violate the constraints in DefaultIndexingChain. I was hoping that seeing your actual analysis chain would help reproduce your issue locally, but it's not as straightforward as I'd hoped, given the number of custom analysis components (I also note, without thinking through possible implications, the lack of FlattenGraphFilter, and the fact that your word-delimiter and synonym filters seem (\?) to be monkey-patched variants of "pre-graph" WordDelimiterFilter and SynonymFilter).

asfimport commented 4 years ago

Roman Chyla (migrated from JIRA)

@magibney the original issue in this ticket is about ability to run span queries and highlight them properly. As for positions, posLen attribute solves the search part (well, if it was indexed - which it is not) ; but it cannot solve the highlight part - at least I don't see how it could. And I have shown that in this example:

term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
 term=syn::massachusetts institute of technology posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
 term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
 term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
 term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
 term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
 term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
 term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
 term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64

Notice the posLen=4 of MIT; position length attribute is **wrong** because of stopwords. It would cover tokensmassachusetts institute technology antidesitterwhile offsets are still correct.

That our chain is complex is true, but I don't think it matters at all, because the tokenizer is producing the stream of tokens listed above – and it is "broken" only in the sense that backward offsets are disallowed (none of our filters are broken in the sense of being 'buggy' - they are broken in the sense of producing what is now deemed "illegal"). And yet the "illegal" in this case, I and some other people would posit, is the "correct" for expert use cases. The discussion will hopefully help revisiting the problem.

If I were to add the flatten filter, I'd be able to index the tokens because the filter trims (makes offsets only grow, never go back). But if you think about it: that is impossible proposition with multi-token synonyms if at the same time you want to index the individual tokens that make them.

1: Massachusetts | Massachusetts Institute of Technology

2: institute

3: technology

'MIT' spans three indexed-tokens positions – but it also spans *four original tokens positions* ("Massachusetts Institute of Technology")

And if there were two parallel filters inside one tokenizer chain; each of them picking different tokens and paying attention to different stopwords – as @dsmiley described - I don't see how can their output be interlaced to progress in a linear fashion (to not trip offset asserts).

BTW: The output example is produced by lucene 6.x; when ported to Lucene 7.x it produces exactly same stream - but it trips https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L917

asfimport commented 4 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Roman, I think we're on the same page re: positionLength. I read the original case on this issue (re: double-emission of tokens) as a sympathetic type of "XY problem", and I'm suggesting that indexing positionLength (LUCENE-4312 – as opposed to simply using unindexed positionLength) would be a better fundamental way to address the "Y" case (working positional queries) than accommodating the "X" case (double-emission of tokens, which may be "less" broken, but afaict is still broken in its own way).

I also need to apologize, I was indeed overlooking something: although the asterisked terms in the examples you shared above still don't seem problematic to me (and I still see no problem with the "THE HUBBLE constant: a summary of the hubble space telescope program" example), I see that the latter two examples ("MIT and anti de sitter space-time" and "Massachusetts Institute of Technology and antidesitter space-time") each have one (and only one?) problem: a backward startOffset on the last token. A couple of random thoughts on that:

because the positionIncrement of each of these is "0", it would be possible in principle to swap with the preceding token to satisfy the constraints enforced by DefaultIndexingChain. This isn't an argument that the issue is irrelevant; rather, it's a wish for another example that can't in principle be "solved" in such a way.
In the "gut feeling" department: I'm a little wary of this being on the last token (tokenStream components can exhibit unusual behavior at the beginning/end). If I were troubleshooting, I'd probably first add an extra term at the end of each input field value and see how this affects things, just as a sanity check before digging deeper.

FWIW, I was asking about the analysis chain for context; it's not so much the complexity of the analysis chain that prevents me from trying to reproduce locally as the fact that it uses several custom components (and some of those based on deprecated implementations) ...

asfimport commented 4 years ago

Roman Chyla (migrated from JIRA)

Michael, I appreciate your thinking about the issue - it's always pleasure to meet somebody who changes his/her mind based on evidence. As far as it seems to me, the backward going offsets are indeed the 'only' problem – but unfortunately of some 'fundamental' category. It doesn't square nicely with the desire to efficiently store offsets using deltas.

I can reassure you that the problem doesn't appear on (only) first or lasts positions. The unittests linked fro my post have specific examples for that (even searching for phrases embedded/surrounded by other tokens)

The swapping - unless i'm missing something important - will not work (for multi-token situation). Because the next token is going to be checked against offsets of the last emitted token. So wherever the multi-token gets emitted (as first, last, or even in the middle) - it will trip the wire for the surrounding tokens (or these already emitted tokens will trip the wire for the multi-word token)

funnily enough, swapping 0th position could trigger another issue (but only for the very first token in the stream): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L897

that the

asfimport commented 4 years ago

Roman Chyla (migrated from JIRA)

@magibney now it's my turn to apologize, all the time I have missed that WordDelimiter factory was involved. Well, the problems shown in my two examples above are real (the positionLength spans 4 tokens that don't belong together), but I now have to question my understanding of everything else. See my email to @mikemccand from few moments ago (the word-delimiter factory is involved; to my slight relief - that's a stock solr version)

from the email:

Hi Mike,

I'm sorry, the problem all the time is inside related to a word-delimiter filter factory. This is embarrassing but I have to admit publicly and self-flagellate.

A word-delimiter filter is used to split tokens, these then are used to find multi-token synonyms (hence the connection). In my desire to simplify, I have omitted that detail while writing my first email.

I went to generate the stack trace:


```java
assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
        "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program"));```

stage:indexer term=xxxxxxxxxx603 pos=1 type=word offsetStart=0 offsetEnd=13
stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3
stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10
stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20
stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30
stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50
stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68

that worked, only the next one failed:

assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
        "title", "MIT and anti de sitter space-time"));

stage:indexer term=xxxxxxxxxx604 pos=1 type=word offsetStart=0 offsetEnd=13
stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3
stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3
stage:indexer term=syn::massachusetts institute of technology pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12
stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15
stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22
stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28
stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33
stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33

stacktrace:

325677 ERROR (TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40]) [    ] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id 605 to the index; possible analysis error: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for field 'title'
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:242)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$2(DistributedUpdateProcessor.java:1082)
at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:125)
at org.apache.solr.util.TestHarness.update(TestHarness.java:285)
at org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:274)
at org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:244)
at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:874)
at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:853)
at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:847)
at org.apache.solr.analysis.TestAdsabsTypeFulltextParsing.setUp(TestAdsabsTypeFulltextParsing.java:223)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for field 'title'
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:823)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1616)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1608)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:969)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235)
... 61 more

apache / lucene

Start offset going backwards has a legitimate purpose [LUCENE-8776] #9820