apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.04k forks source link

Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream [LUCENE-2939] #4013

Closed asfimport closed 13 years ago

asfimport commented 13 years ago

huge documents can be drastically slower than need be because the entire field is added to the memory index this cost can be greatly reduced in many cases if we try and respect maxDocCharsToAnalyze

things can be improved even further by respecting this setting with CachingTokenStream


Migrated from LUCENE-2939 by Mark Miller (@markrmiller), resolved Apr 20 2011 Attachments: LUCENE-2939.patch (versions: 4) Linked issues:

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I'm a little rusty on the new tokenstream api, but here is a little test patch I popped out real quick

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

hmmm...didn't quite get it right yet I think...

q=java man news th*

java.lang.NullPointerException
    at org.apache.lucene.util.CharacterUtils$Java5CharacterUtils.fill(CharacterUtils.java:181)
    at org.apache.lucene.analysis.CharTokenizer.incrementToken(CharTokenizer.java:150)
    at org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter.incrementToken(WordDelimiterFilter.java:224)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:71)
    at com.ACME.analysis.ACMEPluralStemFilter.incrementToken(ACMEPluralStemFilter.java:56)
    at org.apache.solr.highlight.TokenOrderingFilter.incrementToken(DefaultSolrHighlighter.java:575)
    at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:78)
    at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:50)
    at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:220)
asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

only seems to happen when maxDocCharsToAnalyze is absurdly low - like 5

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i don't know why you get this null pointer exception (maybe you triggered a bug), but...

just a quick glance:

  1. why use offsets for this calculation? This seems a bit dangerous versus other approaches.
  2. either way, the reset() method should clear any state such as counters in the tokenstream.

As far as what i meant above... the whole maxDocCharsToAnalyze seems like the wrong measure. Why not specify this just as max tokens, and use LimitTokenCountAnalyzer, which is already implemented.

using arbitrary chars and offsets is going to create fake tokens (e.g. truncate words) and other problems. besides, its not unicode safe since a codepoint might span multiple chars.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Because this is already a setting on the Highlighter that appears to work by offset?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Maybe the setting is already there, but I think we should remove it: I don't think its the best measure.

We can instead replace it with a max # tokens setting, which is more intuitive, easier to implement, and consistent with how other things are limited (e.g. the old IW setting and the new LimitTokenCountFilter).

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Fair enough (this setting is well before my time AFAIK) - but not my intent with this issue - which is just to fix this little perf bug.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

The other problem was that CachingTokenFilter was exhausting the entire stream eagerly - which could be a spin through a very large TokenStream - uselessly if a user has set the maxDocCharOffset setting.

This and adding the whole stream to the MemoryIndex was a very large performance bug in the span highlighter for some time now.

In my test case, using Solr's DEFAULT_MAX_CHARS_TO_ANALYZE = 50*1024, highlighting 10 very large PDF docs I have dropped from 20 some seconds to 300ms.

New patch with some fixes and cleanup. I don't see the above error with a more correct TokenFilter impl.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

My last patch is missing a couple required test compile changes - I excluded that class cause I had some test code in it.

I'll put up a new patch as soon as I get a chance with the test class changes (Scorer init method gets a new param and there are a couple anonymous impls in test)

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Honestly, if I was not so busy, I'd say we should really get this in for 3.1.

If you are doing something like desktop search, this can be a really cruel highlighter perf problem.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

P.S. One that is really a bad bug in my mind - we switched this to be the default and the old Highlighter did not suffer like this in these situations.

Looking back over the email archives, it bit more than a few people. I'm pretty sure this bug was the impetus of the Fast Vector Highlighter (which is still valuable if you really do want to highlight over every token in your 3 billion word PDF file ;) ).

You pay this huge perf penalty for no gain and no reason. If you are talking wikipedia size docs, it won't affect you - but for long documents, doing 10 snippets can be prohibitive, with no workaround. That is not a friendly neighborhood highlighter.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i think the offsetLength calculation needs to be inside the incrementToken?

Honestly, if I was not so busy, I'd say we should really get this in for 3.1.

yeah, performance bugs are bugs too.

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I can backport if you want.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

i think the offsetLength calculation needs to be inside the incrementToken?

I do not follow ... incrementToken is:

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Exactly, so what is the attributes values before calling input.incrementToken() ?

I don't think this is good practice to work with the uninitialized values.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

This includes the change to the test to make it compile.

Still no Changes entry.

The compile change to the test is a back compat break. The Scorer needs to know the maxCharsToAnalyze setting.

Have not had time to consider further yet.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I can backport if you want.

+1

I don't think this is good practice to work with the uninitialized values.

I see what you mean now - though I still don't understand your previous comment. I assume that it's just defaulting to 0 - 0 now?

Yeah, that could be changed.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I see what you mean now - though I still don't understand your previous comment. I assume that it's just defaulting to 0 - 0 now?

Only the first time.

But imagine you try to reuse this tokenstream (maybe its not being reused now, but in the future)... the values for the last token of the previous doc are say 10 - 5... the consumer calls reset(Reader) with new document and reset(), which clears your accumulator, but this attribute is still 10 - 5 until input.incrementToken()... only then does the tokenizer update the values.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Given the back compat breaks in the API, are we sure we should try to shove this into 3.1?

I am sympathetic to performance bugs, BUT it seems that one could use TermVectors and FastVectorHighlighter for these large documents, the user is hardly left without options.

As a safer alternative we can document the issue in CHANGES.txt and recommend that users take that approach for large documents, and take our time and fix for 3.2

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Given the back compat breaks in the API, are we sure we should try to shove this into 3.1?

I won't do the work, so whatever form my perspective...

the user is hardly left without options.

Depends on what you mean by options - FastVectorHighlighter cannot highlight half our queries (multi-term last I knew, or Span) - trade one bug for anther.

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I'm OK either way, but it does seem like a pretty big performance bug.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Depends on what you mean by options - FastVectorHighlighter cannot highlight half our queries (multi-term last I knew, or Span) - trade one bug for anther.

Right, but you can use term vectors with this highlighter too right? This issue only seems to refer to the case where you have no term vectors and analyze the text at runtime...

I don't think its too much to say 'index your content according to what you are going to need'

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Grant: in terms of the back compat issue - I'm not really worried about it myself since this is contrib and we have changed these interfaces before with no complaint -

but another tmp option is to special case and do an instanceOf check on the Scorer - and if its our QueryScorer, cast and set the max chars to analyze.

It's not as pretty, but it avoids the method sig change.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Right, but you can use term vectors with this highlighter too right? This issue only seems to refer to the case where you have no term vectors and analyze the text at runtime...

Nope, that's not true. If you turn on term vectors, that does NOT solve this bug.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Anyhow, all I have time for on this today.

I'll leave it up to you guys...err, I mean robert, to decide what to do here.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'll leave it up to you guys...err, I mean robert, to decide what to do here.

Sorry you feel this way... everyone says they want faster releases but doesn't want to take the appropriate steps to move towards a model that supports that.

In order to release more often we have to stop this cycle of shoving things in at the last minute.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

In order to release more often we have to stop this cycle of shoving things in at the last minute.

As always in Lucene land, these things should be taken case by case depending on the facts - the severity of the bug and its affect on the release. Tese things can often be discussed by more than a single person.

Not ramrodded by someone being a bit of an asshole.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Not ramrodded by someone being a bit of an asshole.

Tese things can often be discussed by more than a single person.

Yes, anyone can can produce a release candidate of Lucene. But if its going to be me doing it, i've already set aside time (and coordinated with others) to make RC builds. So I'm going to push back on shoving in last minute changes.

Well you can call it that, or someone trying to be a release manager that will actually get out a release in the next year.

Bottom line: if you feel this change is really important, I respect your decision on that. But you should set the issue to blocker and be aware that the tradeoff likely means delaying the RC for a few weeks (unless someone else steps up to volunteer to produce an RC, which is fine!)

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

What you don't seem to get is that I don't mind if you push back. I don't mind your position.

I mind your attitude. Changing the issue target 2 seconds after Grant with no discussion. Declaring on your own that it won't get in. Not trying to get to a real conversation about the issue (which you clearly don't fully understand if you think storing term vectors will help). These things are my issue, not any so called push back.

Well man, you need us on your team too. Performance bug is a technical valid reason for a -1 on a release. I'm not threatening that - but I'm pointing out that everyone needs to be on board - not just the RM. Taking the time for fair discussion is not a waste of time.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I mind your attitude. Changing the issue target 2 seconds after Grant with no discussion. Declaring on your own that it won't get in. Not trying to get to a real conversation about the issue (which you clearly don't fully understand if you think storing term vectors will help). These things are my issue, not any so called push back.

Its not an attitude, and its not personal. Its trying to stop last minute stuff from being shoved into the release right before the RC, especially if its not fully-formed patches ready to be committed.

Well man, you need us on your team too. Performance bug is a technical valid reason for a -1 on a release. I'm not threatening that - but I'm pointing out that everyone needs to be on board - not just the RM. Taking the time for fair discussion is not a waste of time.

I totally agree with you here. But some people might say, if the bug has been aroudn since say 2.4 or 2.9 that its not critical that it be fixed in 3.1 at the last minute, and still +1 the release.

As i stated earlier on this issue, I'm sympathetic to performance bugs: performance bugs are bugs too. But we need to evaluate risk-reward here.

Just don't forget that there are other performance problems with large documents in lucene (some have been around a while) and we aren't trying to shove any last minute fixes for those in.

So, here are my questions:

  1. What version of Lucene was this performance bug introduced in? Is it something we introduced in version 3.1? If this is the case its more serious than if its something thats been around since 2.9.
  2. Why is fast-vector highlighter with TVs "ok", but highlighter with TVs slow?
asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I think Robert's right, we should not have shoved this in at the last minute, even though it is a pretty big issue for those doing highlighting of larger documents. I'd say we just mark it as 3.1.1 or 3.2.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think 3.2 is a good tradeoff, unless we introduced this slowdown in 3.1 (my earlier question).

If we are introducing this slowdown in the 3.1 release, then I think its much more serious, and I would instead suggest we set the issue to blocker.

Regardless I think there are some technical steps that can be taken to easy my mind about the patch, for example the TokenFilter here can be tested independently with BaseTokenStreamTestCase (this is good at catching reuse bugs like the one I hinted at).

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

we should not have shoved this in at the last minute,

We didn't? Marking something as 3.1 is the best way to get it considered for last minute inclusion, blocker or not. It certainly doesn't mean its not going to be pushed back out after discussion.

In any case, if you are not for it, that decides it - I'm not willing to do the work right now.

So, here are my questions:

  1. I don't remember 2.9 probably.
  2. Because it's a completely different approach.

It's been around for a while. I saw one guy that stayed on Solr 1.3 over 1.4 because of it. Most people will try fast vector and say oh nice, it's fast - but it doesn't highlight wildcard queries or these queries, etc. They either accept one bug over the other, or stick with an older version. Honestly, if that continues for another release, it's no skin off my nose. But neither are most bugs.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In order to release more often we have to stop this cycle of shoving things in at the last minute

+1

Dev is a constant thing around here and we keep holding back a release for the one-more-issue to get in we will never release.

Our lack-of-release reflects badly on Lucene/Solr – the outside world uses this as the proxy for our health and we know we get bad marks.

Worse, this whole situation (people getting angry at the RM for doing precisely what the RM is supposed to do) is a disincentive for future RMs to volunteer doing releases, thus causing even less frequent releases. It's already hard enough for us to get a release out as it is.

The RM is supposed to be an asshole (not that Robert has acted like one, here, imho). S/he has full authority to draw the line, crack the whip, do whatever it takes to get the release out. We all cannot question that, unless we want to step up and be the RM because it is NOT an easy job.

I think this issue should wait for 3.2.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Worse, this whole situation (people getting angry at the RM for doing precisely what the RM is supposed to do) is a disincentive for future RMs to volunteer doing releases, thus causing even less frequent releases. It's already hard enough for us to get a release out as it is.

I'm not sure I agree. Declaring that this will not get in is beyond the scope of an RM in my opinion. Putting pressure is fine, but just because being an RM is hard is not a King for a day pass IMO. It's up to the RM to build the release candidate from whatever issues he wants - does that mean he needs to man handle JIRA?

The RM is supposed to be an asshole (not that Robert has acted like one, here, imho).

I don't think he is? Too strong a word. I didn't even use it full to refer to the 2 actions I was commenting on. First, I said a acting like a bit of an asshole and I separated it a distance as theoretical. It was weak sauce commentary on the two actions I've pointed out:

  1. Just reverting a change another respected committer made immediately and without discussion - without too much investigation - because that issue was subsumed by another issue that had already been moved for 3.1 consideration and we where still discussing.

  2. Declaring that this will not make it in over ongoing discussion.

S/he has full authority to draw the line, crack the whip, do whatever it takes to get the release out.

Do whatever it takes? Come on...

We all cannot question that, unless we want to step up and be the RM because it is NOT an easy job.

I do question it. I have stepped up to be the RM, I know it's not an easy job, and I'll volunteer to do it again sometime.

Robert is great at it - in general he has all my support in the world. I certainly understand the difficulties hes facing trying to point this release and he has my sympathies - and I wish he had more of my help. But that doesn't change my reaction to his actions. I feel the same way and I have the same responses to it. Meanwhile, I still like and respect Robert.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Putting pressure is fine, but just because being an RM is hard is not a King for a day pass IMO

Putting pressure is precisely what Robert has done here (and on SOLR-2390)?

He's acting just like an RM should act, as far as I can tell. Moving an issue out, stating that an issue won't make the RC, is fair game. These are the normal "tools" of the RM...

My point is, we all must expect/allow/not-get-upset-about this "pressure" from the RM – it comes with the territory, and it's the RM's right to be very aggressive in order to get the release done.

Else releases will not get done, and we'll all keep one-more-thing'ing the release, and that's bad for all of us.

We gotta remove the barriers to doing releases around here, not add to them. In fact, we should scrutinize our scary ReleaseTodo and pare it back to the bare minimum... it's gotta become a push button process.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Moving an issue out, stating that an issue won't make the RC, is fair game. These are the normal "tools" of the RM...

Not in my experience.

Regardless, I'm not sure this is the same as changing a JIRA issue right after someone else changes it with no discussion and apparent lack of understanding of the issue. That's a statement if you ask me. If you look at how the culture of Lucene has worked, this is unusual - and I'll push to make it remain so.

Putting pressure is precisely what Robert has done here (and onSOLR-2390)?

SOLR-2390 was this issue - this patch spans lucene and solr and covers both of them. This issue was still marked 3.1 and we where discussing it when this happened with SOLR-2390 - this is how I know little thought went into shoving it out - SOLR-2390 should be in lock step with this issue. It's not a new or another issue. It's just where I am tracking this from a Solr user bug perspective - it's easier to have one patch.

My point is, we all must expect/allow/not-get-upset-about this "pressure" from the RM – it comes with the territory, and it's the RM's right to be very aggressive in order to get the release done.

Depends - I've seen a lot of RM's before - and I have been one. Personally I've never seen things done this way. Nor do I think it was necessary. We would have come to the same conclusion in either case. The history and culture of Lucene has been to not be forceful in JIRA - that's something I'll argue to maintain.

We gotta remove the barriers to doing releases around here, not add to them.

Release at all costs is just not an excuse IMO. We release as often as someone is willing to put in the somewhat massive effort really.

In fact, we should scrutinize our scary ReleaseTodo and pare it

back to the bare minimum... it's gotta become a push button process.

I think we all agree with that. I'm still not aboard with the scary RM theory.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I will also note, there was never any strong argument to include this issue.

There was never any danger of this needing to be strong armed out of 3.1.

I've already said I wouldn't do it - and Grant had volunteered, but never argued for it either.

asfimport commented 13 years ago

Trey Hyde (@treyhyde) (migrated from JIRA)

Sorry if I missed it in this thread, which branch was this patch made against? It doesn't apply cleanly against branch_3x.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Sorry if I missed it in this thread, which branch was this patch made against? It doesn't apply cleanly against branch_3x.

This patch is against trunk - still needs a fairly simple back port to 3x.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The patch needs more than a simple back port.

The patch needs to be fixed. It has tokenstream reuse bugs that cause offsets from last token of the previous document to be applied to the calculations of the next document, because it reads dirty attributes.

Its not just release manager being an asshole here, there are technical problems that need to be fixed.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

The patch needs more than a simple back port. The patch needs to be fixed.

And that is simple too if you follow the above comments.

You should pop the offset calculation into the if statement -

I'm not convinced it's a problem in this situation (especially for someone wanting to try a patch), because this works one document at a time.

Its also simple not to break the api as I mention above.

I have done all of these things in my own work earlier (and added a test for the new filter) - took about 2 minutes.

Eventually I will post another trunk patch.

Doing a solid review and back port of this patch would not take long - it's fairly simple. I won't likely get to it for 3.X for a while though.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

here is a more up to date version of the patch for trunk - good for testing performance difference of this issue

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Mark,

Seems like we can move forward with this now that the release is out. Do you have time or do you want me to take it?

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay - I'm going to commit to trunk shortly.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Bulk closing for 3.2