Extend contrib Highlighter to properly support PhraseQuery, SpanQuery, ConstantScoreRangeQuery [LUCENE-794]

asfimport commented 17 years ago

This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter package that scores just like QueryScorer, but scores a 0 for Terms that did not cause the Query hit. This gives 'actual' hit highlighting for the range of SpanQuerys, PhraseQuery, and ConstantScoreRangeQuery. New Query types are easy to add. There is also a new Fragmenter that attempts to fragment without breaking up Spans.

See #1481 for some background.

There is a dependency on MemoryIndex.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

There are two highlighting modes: highlight entire spans or highlight first and last word of each span. For the highlight first and last word of span it would probably be better to change QuerySpansExtractor.getSpansFromPhraseQuery so that it creates a series of near spans instead of a single near span with multiple clauses.

asfimport commented 17 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Looks like a good start, Mark - thanks for contributing this!

I've had a quick play and have identified the following issues:

1) Fieldname "contents" shouldn't be hardcoded into the Highlighter - different analyzers can behave differently for different fields (see PerFieldAnalyzerWrapper). Either pass a fieldname parameter or do as the existing highlighter does and take a TokenStream. The latter approach has the advantage of being able to avoid re-analysis and make use of any stored TermVectors (see TokenSources.java) 2) Analyzers which produce overlapping tokens (see Synonym analyzer in existing highlighter Junit test) are problematic in the existing code. I remember the "TokenGroup" class in the existing highlighter was an approach to help cater for these "overlap" scenarios. 3) Without wishing to resurrect the whole 1.4 vs 1.5 debate I beleive Lucene still targets Java 1.4.

To rectify these points it's not clear to me if it would be quicker to use your code or adapt the existing highlighter code to use spans. Thoughts?

Thanks, again, Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Sorry about all that Mark H. This was literally just some test code that I quickly shoved into an api similar to your existing highlighter. If you decided that it should be something considered on it's own I would certainly have quite a bit further to go. Mostly I just put it up for your evaluation on extending the current highlighter with this highlight method.

>1) Fieldname "contents" shouldn't be hardcoded into the Highlighter - different analyzers can behave differently for different fields (see >PerFieldAnalyzerWrapper). Either pass a fieldname parameter or do as the existing highlighter does and take a TokenStream. The latter approach >has the advantage of being able to avoid re-analysis and make use of any stored TermVectors (see TokenSources.java)

I don't have a great solution for this right now. I need to read the TokenStream at least twice due to the MemoryIndex extracting the spans. Unfortunately, it seems I can copy the tokens to a list or pass them to the MemoryIndex – I cannot do both. The MemoryIndex is also looking for a field name...so while I changed the api to take a TokenStream, I have not resolved also needing the field name. I am hoping you have some good comments. To get around reading the TokenStream twice I used the horribly hackey but quick-for-me method of adding a method to MemoryIndex that accepts a List of Tokens. Any ideas?

2) Analyzers which produce overlapping tokens (see Synonym analyzer in existing highlighter Junit test) are problematic in the existing code. I remember the "TokenGroup" class in the existing highlighter was an approach to help cater for these "overlap" scenarios.

I always attack this last <G>. Seems a simple fix: if position increment equals 0 skip printing out the token. It passes your test which I have added to my test code, but I am not totally confident it is perfect yet.

3) Without wishing to resurrect the whole 1.4 vs 1.5 debate I beleive Lucene still targets Java 1.4.

Just me being lazy. I swear I have seen Contrib stuff that says 1.5. I have gone through and stripped out all of the 1.4 except for StringBuilder for the moment.

>To rectify these points it's not clear to me if it would be quicker to use your code or adapt the existing highlighter code to use spans. >Thoughts?

Depends entirely on what you think. I am sure I can fix all of the issues you mention (with a little advice <G>), but I am pretty new to this type of thing and perhaps you just want to start from scratch in order to achieve span highlighting with the existing highlighter. It may just be that the way I am doing this is not very compatible with the way you currently fragment and score.

I have added an updated Highlighter.java and HighlighterTest.java. The MemoryIndex problem remains...so it either has to be fixed or the modified MemoryIndex must be used.

Mark m

asfimport commented 17 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

There is indeed some Java 1.5 code in contrib/ I believe the gdata-server uses 1.5 classes. I think that's okay for contrib.

asfimport commented 17 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

>>Sorry about all that Mark H No need for any apologies - all help is gratefully received! I don't mean to criticise your efforts or seem picky - I just wanted to record my findings somewhere useful if we were to consider working a solution up from this "test code" rather than tweaking the current highlighter - I'm still uncertain about the best approach. I also thought it might be useful to point the potential issues out to you if you were already reliant on using this code somewhere.

>>I need to read the TokenStream at least twice >>I used the horribly hackey but quick-for-me method of adding a method to MemoryIndex that accepts a List of Tokens. Any ideas?

I'm not sure about modifying MemoryIndex. It should be easy enough to create a subclass of TokenStream - ("CachedTokenStream" perhaps?) which takes a real TokenStream in it's constructor and delegates all "next" calls to it (and also records them in a List) for the the first use. This can then be "rewound" and re-used to run through the same set of tokens held in the list from the first run.

>>if position increment equals 0 skip printing out the token...but I am not totally confident it is perfect yet.

I think it's possible some of the more Byzantine analyzers may have a position increment >0 but overlap in terms of their byte offsets. I'd need to check the old Junit tests to be sure on this. Welcome to my hell!

Thanks again for your help. Mark H

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Removed 1.5 dependencies, fixed api

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I switched to accepting an analyzer and a field name. I need the field name anyway for the MemoryIndex.

I agree that modifying MemoryIndex was horrible and I have removed that dependency (just did it as a 'quickfix').

I used the CachedTokenStream anyway to avoid analyzing twice (once for MemoryIndex and again for Highlighter use. Thanks for the idea...shows how bright I am having missed it <g>).

I removed all of the 1.5 code.

The code is probably fairly usable right now then. I think synonyms work fine unless a case does exist like you suggested.

So I suppose we have 4 options:

I extend and polish the code (needs more test cases, most of mine where written using my query parser) and it is used independently for full document highlighting based on spans. (I would like to add google cache like coloring)
The code is either merged with the existing highlighter or gutted to create a single highlighter that can fragment based on spans or based on the original term based approach.
The code is ignored and someone else starts fresh adding span support to the existing highlighter.
The code languishes in purgatory and we await the unknown.

Mark M

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Updated code to address deficiency in highlighting BooleanQueries.

Use the following latest classes:

CachedTokenStream DefaultEncoder Encoder Formatter Highlighter QuerySpansExtractor SimpleFormatter HighlighterTest

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Using an Analyzer that produces multiple tokens at the same position does not yet operate correctly if used at query time. Using such a 'synonym' analyzer for indexing and a non 'synonym' analyzer for searching will work fine.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I had some free time today and came back this issue. I was so set on my own needs to start on this that I completely ignored looking closely at the contrib highlighter code. I went back and read over it this morning and am in the middle of a new solution. The new solution is in the form of new SpanQueryScorer that extends Scorer and plugs into the original contrib highlighter code. I have adapted almost all of the original tests (still a few to go) and so far they all still pass using the SpanQueryScorer. There is no guarantee yet that Spans will not be chopped up, but I am sure there is a way to share Span info with a Fragmenter if you wanted to rectify this (I may get to it). I also have not implemented a scoring properly yet...at the moment any term that is found returns a score of 1, and each unique term in a fragment contributes 1 to the fragment score. I will look at going further here, but I will be posting the code first after I convert the rest of the relevant tests and add a few Span Query tests.

I am pretty confident this will be a great solution for 'actual hit' highlighting with the already tried and true contrib Highlighter, fragments and all.

-Mark

asfimport commented 17 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

>>I am pretty confident this will be a great solution

Great stuff, Mark. Sorry I've been out of the loop on this recently and not participating as much as I'd like - just too tied up with other work. I look forward to seeing your work!

Cheers, Mark H

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Howdy Mark H, I have not got into making new SpanQuery tests yet, but at this point I could use some help/guidance. All of the original highlighter tests are passing with the new SpanScorer except for two:

testFieldSpecificHighlighting

This will not pass the second assertion (ignore fields) because when i add the TokenStream to a MemoryIndex I have to add it to a field. I am stumped on getting around this one.

testOverlapAnalyzer2

Passes the first bunch but then fails on one. This is because I am looking up terms based on position since the Spans do not return the term text. The first assertion failing is looking for 'hi-speed' but finds 'hi-speed' because both 'speed' and 'hi-speed' are at position 0...consequently both score a 1. Any thoughts? I was thinking about gathering all possible terms in the SpanQueryExtractor and someone using them...

Beyond that, I am sure you can find plenty of other things to point out . Have at me <g>

Any ideas on scoring would be appreciated as well.

Feel free to run with this on your own if you have time as well...or run with it a bit and pass it back, or just provide some guidance as I go...whatever works out best for you.

Mark M

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

By the way...I apologize the file list is so messy now.

You just need:

SpanScorer SpanQueryExtractor CachedTokenStream SpanHighlighterTest

and there is the dependency on MemoryIndex

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Almost at the holy grail here. Everything works except the optional ignoring fields in the Query object. Scores work, all other tests pass, and even better there is no more limitation of only highlighting the first and last term in a Span – instead all correct terms in each Span will be highlighted. The only change to the existing code I had to make was to add a parameter to scoreToken(Token token) – I had to add int position.

I still think it is very feasible to pass info from this SpanScorer to a Fragmenter so that the Freagmenter can attempt to avoid splitting up Spans.

The current code will correctly highlight pretty much any standard or span query ( I think <g>) based on 'actual' hits using the exisiting contrib highlighter code...I have yet to write out the new extensive Span tests and I would appreciate it if some others would go over the code for some obvious improvements, but this is almost ready.

Get the latests: SpanScorer SpanQueryExtractor CachedTokenStream SpanHighlighterTest WeightedSpanTerm Highlighter

Mark

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Mark, I'll take a look at this any year now. I think the code can be used or tweaked to act as "term order suggestion" and "untokenized cosmetic suggestion from stored values" in my didyoumean-patch.

Is there some documentation that describes this patch in a chronologically ordered text rather than "just" the java docs? Some simple package level html would probably help me to get started.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I have a patch coming tonight. It fixes a few odd mistakes and has a little more documentation. I had wanted to subpackage it into spanscorer for now, but it appears I can't make a patch with a new folder so that is out. Should I merge my package.html documentation with the one currently in highlighter? Also, I am not sure how a contrib that depends on another contrib should work build file wise (SpanScorer depends on MemoryIndex). I just made up something that works for now.

This new patch will be off the trunk so now the RangeQuery test fails as it does with the original QueryScorer...you cannot highlight a constantrangequery to my knowledge.

You also cannot ignore the fields in the query as you can with QueryScorer so that test fails. The only way that I can see doing this is to have the option in your query parser of ignoring all fields and just using one field name during parsing. Send the field-normal Query to search, and then make a field-neutered query for highlighting. That is the approach I will be taking with my query parser. I sure wish there was something better though.

Ill post the patch when I get out of work.

Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Forget all the .java files...just get spanhighlighter.patch and apply to the trunk.

Still looking for pointers on how to handle to build.xml

asfimport commented 17 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Hi Mark, I found a little time to look at the span Highlighter the other night and was struggling with some missing bits and pieces (updated Scorer, missing SynonymAnalyzer etc) so only got as far as getting it all to compile before I ran out of time. Hopefully the patch will make life easier - will investigate when I have another chance.

As for the build.xml - have a look at XMLQueryParser's build.xml in contrib. This has a dependency on the "queries" contrib module added to the build.xml.

Cheers Mark H

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Yeah the patch should take care of all of that...I would have started with a patch, but this was literally my first and it took me a bit to figure it all out, especially with eclipses subclipse plugin using absolute paths instead of relative in the patch..then I was trying forever to add a new package before finding out I can't add a folder to a patch...but now that I got it all worked out it should make life much easier for anyone trying this <g> I will use patches from now on.

Thanks for the build.xml info and for taking a look.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Patch version 2

Changed to correct build.xml, removed some unneeded code

Has been working well for me personally, could still use some additional span highlighting tests

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

This patch tries another approach instead of changing the existing Highlighter api. The result is that if you call getBestFragments more than once, you must call reset() on the SpanScorer between each call. Whether this is better than modifying the existing api, I am not sure.

This patch also adds a new SimpleSpanFragmenter that fragments based on size, but ensures that Spans are not broken up. This class might not be perfect yet.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I have finally come up with a way to ignore fields and so the final test (testFieldSpecificHighlighting) passes for this. Now all original Highlighter tests pass with this patch. Pass null as the field to SpanScorer and fields will be ignored during highlighting.

SpanScorer now has the same behavior as the QueryScorer except that actual hits are highlighted.

I have also made a small fix to the SimpleSpanFragmenter.

I am still not sure if it is better to change the Highlighter API or require the kind of nasty call to reset the SpanScorer between calls to getBestFragments.

I have used a zip file this time. It contains the patch plus an index folder that holds a new class called TermModifier. This was necessary because I cannot add folders to the patch, but TermModifier needs to be in the org.apache.lucene.index package. First apply then patch, then add the index folder to the correct place in the Highlighter contrib section.

Not a lot left to do here. What do you think Mark H?

Mark

asfimport commented 17 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Hi Mark, Got the code patched and running here. Junit seems to work fine but I feel a little uncomfortable about use of the TermModifier class. Using this has the potentially undesirable side-effect of changing the client's query field. If they plan on re-running the same query this could be a problem.

I'll need to have a think if there is a better solution to this.

Cheers, Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hey Mark,

I wasn't too happy about TermModifier either since I am basically violating encapsulation...TermModifier basically makes field public. I really don't see how it is possible to ignore fields in another way though. If you can think of a way, that would be awesome . At a minimum, the Term fields could be set back to their original value after doing the Span search...I wouldn't think that would be much of a performance hit.

Mark

asfimport commented 17 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

>>At a minimum, the Term fields could be set back to their original value after doing the Span search..

Hmm. If the query is being reused in a multi-threaded server environment this wouldn't fly.

>>I really don't see how it is possible to ignore fields in another way though

I can think of one. Your current approach is based on modifying the query to suit the MemoryIndex content. Another approach may be to modify the MemoryIndex content to suit the query. Your code creates a MemoryIndex when presented with the text of a field. If it recognised it was being used in "field-insensitive mode" it could extract the query terms and create a MemoryIndex field for each unique fieldname in the set of query terms - using the same source text (a CachedTokenStreamAnalyzer could be used to avoid excessive tokenization of this text) This approach would of course use some more memory but avoids the unpleasantness of changing Query objects' contents. I haven't fully considered the implications of this idea yet - initial thoughts?

Cheers Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

"Another approach may be to modify the MemoryIndex content to suit the query. Your code creates a MemoryIndex when presented with the text of a field. If it recognised it was being used in "field-insensitive mode" it could extract the query terms and create a MemoryIndex field for each unique fieldname in the set of query terms"

This should work fine. I had dismissed it ( and again butted heads with it for a while now that you mentioned it) because I couldn't see the forest through the trees. I kept thinking, this is just not going to work with a Span query that has terms from different fields. Over and over I thought that. How can I ignore fields in a SpanQuery. Now it hits me, rather embarrassingly, such a SpanQuery doesn't make sense at all.

I will try your approach and submit a new patch.

Mark

asfimport commented 17 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

>>How can I ignore fields in a SpanQuery. Now it hits me, rather embarrassingly, such a SpanQuery doesn't make sense at all.

Just to make sure we're talking about the same thing. Yes, I too came to the obvious realisation that a single SpanQuery cannot test content from more than one field - but I don't think that is something we were trying to support here. The requirement (as I understand it) is to support a scenario where a SpanQuery was testing only one field, say the "body" field and yet the user wanted to see any matches that just so happened to occur in another field, say the "title" field. Nowhere in the query was there a suggestion of any criteria mandatory or otherwise testing the "title" field - the user just wanted to highlight the title field for additional decoration. In this scenario we have the challenge of taking the "body" query terms and using them to highlight "title" field content. A "match" would have to disregard the original choice of field name but would still require that the positions of term text adhered to the SpanQuery logic.

Hope this makes sense

Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Yup, we are on the same page. I was just buried in the code at the time, and having stared at your code that ignores the field for each Term I was not thinking from a high level but was instead stuck on the process of ignoring fields in a similar manner. For whatever reason it never dawned on me that we don't have to worry about a Span that has Terms with different field values. After staring at your suggestion long enough, my brain de-fogged.

I will submit an updated patch tomorrow.

Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Just for thought, what about a SpanOr query with two sub Span queries that target different fields? Too obscure to care about?

I will post the new patch later tonight.

Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Bah, that last comment is rubbish again. Of course that will work alright. Everything is looking sharp.

On another note though, what do you think about the restriction of having to reset the SpanScorer between calls to getBestFragments? Is this preferable to an api change?

Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Apologize for the delay on this – I was pulled into a busy product launch.

This adds the final piece, replacing TermModifer with multiple Memory Indexes.

I also did a little refactoring, especially in the SpansExtractor.

All tests now pass and I have been using this succesfully for some time now.

For anyone new following this issue, ignore all of the files except for this one: spanhighlighter5.patch

Mark

asfimport commented 17 years ago

Sean O'Connor (@seanoc5) (migrated from JIRA)

I was able to apply the spanhighlighter5.patch. I'm inexperienced with ant and svn, so I assume the slight troubles I had were self-inflicted; I mention them in case they are of any help.

I might have missed something, but my MemoryIndex.java seemed to be missing the implementation of the abstract isPayloadAvailable() method from TermPositions. That was causing my build to fail, so I added the method, simply returning false.

After that change, the tests run, and life was good again. I do get a failed test at org.apache.lucene.search.highlight.HighlighterTest.testGetRangeFragments(HighlighterTest.java:137), but it looks like that might be expected. The search is "[kannedy TO kznnedy]".

I am now looking into getting the total number of hits for a given query (for un-normalized scoring), and the hit positions (saved for larger scale analysis and browsing). I have code that does this, but hope I can improve on my existing approach by using this highlighting patch. Thanks,

Sean

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Sorry Sean, I forgot to mention that the patch is off of the latest Lucene trunk code.

The range query test should fail because they switched the query parser to return a constant score query instead of a range query. Cannot highlight a constant score query.

Mark

asfimport commented 17 years ago

Sean O'Connor (@seanoc5) (migrated from JIRA)

Thanks Mark. I had the trunk from a few days ago (perhaps a week), so that was just me being lazy : -).

Is there anything I should be aware of the: parser.setUseOldRangeQuery(true); in doSearching(String queryString)? [about line 890 in SpanHighlighterTest.java]

I've read the javadocs which explain it a bit, but I don't think a understand enough to infer why you use it in the SpanHighterTest.java. If I can (relatively) safely ignore that, I will.

Sean

Mark Miller (JIRA) wrote: [ [1]https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487860 ]

Mark Miller commented on LUCENE-794:

Sorry Sean, I forgot to mention that the patch is off of the latest Lucene trunk code.

The range query test should fail because they switched the query parser to return a constant score query instead of a range query. Cannot highlight a constant score query.

Mark

SpanScorer and SimpleSpanFragmenter for Contrib Highlighter

            Key: LUCENE-794
            URL: [2]https://issues.apache.org/jira/browse/LUCENE-794
        Project: Lucene - Java
     Issue Type: Improvement
     Components: Other
       Reporter: Mark Miller
       Priority: Minor
    Attachments: CachedTokenStream.java, CachedTokenStream.java, CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, HighlighterTest.java, HighlighterTest.java, HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter_patch_4.zip, SpanHighlighterTest.java, SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, WeightedSpanTerm.java

This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter package that scores just like QueryScorer, but scores a 0 for Terms that did not cause the Query hit. This gives 'actual' hit highlighting for the range of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts to fragment without breaking up Spans. See [3]http://issues.apache.org/jira/browse/LUCENE-403 for some background. There is a dependency on MemoryIndex.

[1] https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487860 [2] https://issues.apache.org/jira/browse/LUCENE-794 [3] #1481

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I use that to make the Range Query test pass. The old style Range Query is highlightable.

asfimport commented 17 years ago

Sean O'Connor (@seanoc5) (migrated from JIRA)

Mark, Can you point me in the right direction? I want to find ALL hits (not just the top xx), and their location in the text.

I think the functionality exists in your patch, or could be easily extended. I just can't seem to get my head around where to start.

Thanks,

Sean

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Updated the patch to version 6. Apply against Lucene trunk.

Updated CachedTokenStream to implement reset() instead of rewind()
Removed rewind checks in CachedTokenStream
Reordered QuerySpansExtractor constructors and added one
QuerySpansExtractor now interns field name for faster comparisons against Token fields

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Hi Mark,

I don't know the details of your patch. I just saw your class CachedTokenStream and was wondering if you're aware of the new class CachingTokenFilter in the analysis package? Maybe you could use that?

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Thanks Michael – I was not aware and will certainly make the change in the next patch I put up.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Minor update to straighten a few things out.

Replaced custom CachingTokenStream with Lucene's CachingTokenFilter
Some refactoring in the SpanExtractor (now WeightedSpanTermExtractor)
Updated some stale JavaDoc

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I plan on one more release and than I am finished.

I need to optimize the scoring (stop looking at positions for terms that are not position sensitive)

Make a couple unit tests to check for a bug I suspect

Turn the javadoc's into something I am actually proud of.

I would wait for this final patch before taking a look at this Mark H.

I apologize for being so incremental on this issue...lesson learned.

Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

patch version 8 : Apply to root dir of trunk

Fixed a bug that was caused when a query had the same term multiple times
Added a unit test for the bug just mentioned
Improved performance by not converting Querys that are not position sensitive to SpanQuerys. Non position sensitive Query clauses are treated the same way the standard Scorer would treat them.
Some refactoring based on the previous change.
Improved some of the JavaDoc comments

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

patch version 9 : Apply to root dir of trunk

Various small improvements.

Be sure to use the recently updated CachingTokenFilter for optimal performance.

asfimport commented 17 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Mark, wow, long list of files up there. I can't tell which ones are still relevant. Ah, only spanhighlighter9.patch, right?

It looks like all files in that patch are new files, that is, this is a parallel highlighter implementation - we can leave the old one in there and commit yours without worrying about breaking the old one. Could you add Apache license headers to all files, switch to 2 spaces for indentation, and then I think this can get committed?

Oh, and since contrib can be java 1.5+, I think you can use StringBuilder instead of StringBuffer, etc.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Requested changes have been made. Only relevant file now is spanhighlighter10.patch.

This is a parallel implementation...it uses all of the current Highlighter classes. Really, it is just a new Scorer implementation that scores position sensitive queries based on correct positions for a hit.

The whole approach was radically changed from the StringBuilder version, so all code is still Java 1.4 compatible.

I have been using this extensively with great success for a few months now.

asfimport commented 17 years ago

Andy Liu (migrated from JIRA)

I gave this patch a whirl, and it looks great.

I do see one problem. Say a document contains:

x y z a b y z

and the query is:

"x y z"

the highlighter will return (with terms in brackets denoting highlighted terms):

[x] [y] [z] a b [y] [z]

Since the last y and z are not part of the full phrase, they should not be highlighted.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I believe the issue is that turning a PhraseQuery into a representative Span query is only an approximate conversion.

I will look into whether or not I can improve this.

Thanks for the feedback.

Mark

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I made up a quick test to identify the behavior but did not duplicate your results:

The results of your example:

doc in index: x y z a b y z

Searching for: "x y z"

Result: x y z a b y z

Could you post some code demonstrating the problem?

asfimport commented 17 years ago

Andy Liu (migrated from JIRA)

Hmm, I tried it again and now it's working correctly. Maybe I had interpreted the output incorrectly. Sorry for the false alarm.

asfimport commented 17 years ago

Andy Liu (migrated from JIRA)

Ah, I wasn't crazy. I had the test data wrong. Here's the code I'm using to produce the failing result:

    String text = "y z x y z a b";

    Analyzer analyzer = new StandardAnalyzer();
    QueryParser parser = new QueryParser("body", analyzer);
    Query query = parser.parse("\"x y z\"");

    CachingTokenFilter tokenStream = new CachingTokenFilter(analyzer.tokenStream("body", new StringReader(text)));
    Highlighter highlighter = new Highlighter(new SpanScorer(query, "body", tokenStream));
    highlighter.setTextFragmenter(new NullFragmenter());
    tokenStream.reset();

    String result = highlighter.getBestFragments(tokenStream, text, 1, "...");
    System.out.println(result);

This produces:

y z x y z a b

The beginning y and z shouldn't be highlighted.

If I change the the beginning y and z to x and y, I get the correct result:

"x y x y z a b" => x y x y z a b

Here's a couple other failing results:

"z x y z a b" => z x y z a b "z a x y z a b" => z a x y z a b

FYI, I'm using the latest version of Lucene.

apache / lucene

Extend contrib Highlighter to properly support PhraseQuery, SpanQuery, ConstantScoreRangeQuery [LUCENE-794] #1869