apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.49k stars 988 forks source link

Concordance/Key Word In Context (KWIC) capability [LUCENE-5317] #6381

Open asfimport opened 10 years ago

asfimport commented 10 years ago

This patch enables a Lucene-powered concordance search capability.

Concordances are extremely useful for linguists, lawyers and other analysts performing analytic search vs. traditional snippeting/document retrieval tasks. By "analytic search," I mean that the user wants to browse every time a term appears (or at least the topn) in a subset of documents and see the words before and after.

Concordance technology is far simpler and less interesting than IR relevance models/methods, but it can be extremely useful for some use cases.

Traditional concordance sort orders are available (sort on words before the target, words after, target then words before and target then words after).

Under the hood, this is running SpanQuery's getSpans() and reanalyzing to obtain character offsets. There is plenty of room for optimizations and refactoring.

Many thanks to my colleague, Jason Robinson, for input on the design of this patch.


Migrated from LUCENE-5317 by Tim Allison (@tballison), 4 votes, updated Nov 24 2019 Attachments: concordance_v1.patch.gz, LUCENE-5317.patch (versions: 2), lucene5317v1.patch, lucene5317v2.patch Linked issues:

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

v1 of patch attached

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Sync'd Tim's patch up to current trunk:

I left a few nocommits.

I plan on reviewing more.

asfimport commented 9 years ago

Tim Allison (@tballison) (migrated from JIRA)

Steve, thank you! I had abandoned hope and haven't been updating this patch on jira. The current version in my local repo looks a bit different now.

Let me apply your patch and see what the diff is between your cleanup/fixes and my current version.

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Tim, FYI, I've used the ASF's ReviewBoard instance a few times recently - it's very nice for comparing two patches against each other, and it can be useful for detailed review too: https://reviews.apache.org/. After creating an account there, the workflow is: manually upload a patch, assign a reviewer (could be the "lucene" group, in which case review requests go to the dev list, or a RB account-holder, including yourself), then publish. Thereafter anybody can review by clicking on one or more adjacent lines in a patch and attaching a comment, repeating till done, then publishing, and the original review request creator can update the patch, and anybody can view differences between any two patched versions, and also attach reviews to the patched version differences.

asfimport commented 9 years ago

Tim Allison (@tballison) (migrated from JIRA)

Thank you, Steve. I created a lucene5317 branch on my github fork. I applied your patch and will start adding my local updates...there have been quite a few since I posted the initial patch.

When I'm happy enough with that, I'll put the patch on rb.

Thank you, again.

asfimport commented 9 years ago

Tim Allison (@tballison) (migrated from JIRA)

I added my latest source code and standalone jars to work with 4.10.2 to my lucene-addons repo in case anyone wants to try the code as is. There may be surprises.

The next step is to turn back to the lucene5317 branch in my fork and update the trunk code.

The biggest functional difference between the original patch from last October and the current working code in my repo is that I added multivalued field handling.

asfimport commented 9 years ago

Tim Allison (@tballison) (migrated from JIRA)

I merged in my local updates and I pushed these to my fork on github link.

I didn't have luck posting this to the review board. When I tried to post it, I entered the base directory and was returned to the starting page without any error message. For the record, I'm sure that this is user error.

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I didn't have luck posting this to the review board. When I tried to post it, I entered the base directory and was returned to the starting page without any error message. For the record, I'm sure that this is user error.

I've successfully used trunk (literally just that) for the base directory in the past - what did you use?

asfimport commented 9 years ago

Tim Allison (@tballison) (migrated from JIRA)

I made the mistake of following instructions and tried /trunk and /trunk/ yesterday. I tried with a git diff file yesterday, and I also just tried with a git --no-prefix diff file today, which seems to work with a traditional svn (patch attached). Today, I tried three variations of trunk. Still confident this is user error. Is there a size limit on diffs or is there something screwy with the attached diff file?

asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

When I tried to make a new review request with your latest patch, I get this error:

The specified diff file could not be parsed. Line 2: No valid separator after the filename was found in the diff header

I've successfully applied your patch to my svn checkout (using svn patch), and I'm posting it here unchanged.

asfimport commented 9 years ago

Tim Allison (@tballison) (migrated from JIRA)

Great. Thank you. I just tried svn diff from the svn checkout that I had patched with the correct git diff...with no luck. I hadn't even svn-added the concordance directory, so the diff file was quite short.

Are you using rbtools or have you had luck with the web interface?

And success with installing rbtools:

Searching for RBTools
Reading https://pypi.python.org/simple/RBTools/
Download error on https://pypi.python.org/simple/RBTools/: [Errno 10061] No conn
ection could be made because the target machine actively refused it -- Some pack
ages may not be found!
Couldn't find index page for 'RBTools' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
Download error on https://pypi.python.org/simple/: [Errno 10061] No connection c
ould be made because the target machine actively refused it -- Some packages may
 not be found!
No local packages or download links found for RBTools
error: Could not find suitable distribution for Requirement.parse('RBTools')
asfimport commented 9 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

One of the nice things about svn patch is that it automatically does the svn add stuff for you.

I've never tried rbtools, always used the web interface, no problems so far.

asfimport commented 9 years ago

Tim Allison (@tballison) (migrated from JIRA)

Switching to Chrome was the answer, apparently: link

Thank you for the tip on "svn patch"...I just upgraded my Linux svn to something more appropriate for this decade, and now patch is available. :)

asfimport commented 7 years ago

Tim Allison (@tballison) (migrated from JIRA)

Now available via Maven central.

<!-- https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5317 --> <dependency> <groupId>org.tallison.lucene</groupId> <artifactId>lucene-5317</artifactId> <version>6.1-0.2</version> </dependency>

If anyone has an interest in helping me change the namespace back to org.apache.lucene, let me know. ;)

asfimport commented 7 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

Hi @tballison, I'd be interested in getting this into trunk, I agree it can be a useful analysis tool.

asfimport commented 7 years ago

Tim Allison (@tballison) (migrated from JIRA)

Great. I'll start a fresh branch in my fork, pick up where @sarowe left off and submit a new PR. Probably be a few days to early next week. Thank you!

asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user tballison opened a pull request:

https://github.com/apache/lucene-solr/pull/82

First draft of LUCENE-5317

First draft of LUCENE-5317

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tballison/lucene-solr LUCENE-5317

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/82.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#82`

commit ea9fd7fdd4d94fd498f0188b9aab0c8cf48c7295 Author: tballison <tallison@mitre.org> Date: 2016-09-23T19:19:22Z

Rough draft of LUCENE-5317.

commit 632c00980d1f7257b15b5dfde445168940dd423c Author: tballison <tallison@mitre.org> Date: 2016-09-23T19:20:36Z

Merge remote-tracking branch 'upstream/master' into LUCENE-5317

asfimport commented 7 years ago

Tim Allison (@tballison) (migrated from JIRA)

Rough, rough draft. Any and all comments are welcomed!

Many thanks, again, to @sarowe for doing the first draft of an actual integration w/ Lucene.

asfimport commented 7 years ago

Tim Allison (@tballison) (migrated from JIRA)

I received a personal email asking for some more background on this capability. Here goes (apologies for some repetition with the issue description)...

For an example of concordance output, see these slides. Slides 23 and 24 for LUCENE-5317 and slides 25-28 for #6382.

The notion is that you present every time the term appears in the central column with x number of words to the left and right. The user can sort on words before the target term to see what modifies it, or the user can sort on words after the target term to see what it modifies, or the user can sort on order of appearance within the documents to effectively read everything in their docs that matters to them.

By target term, of course, I mean any term/phrase that can be represented by a SpanQuery.

This kind of view of the data is extremely helpful to linguists and philologists to understand how words are being used. It also has practical applications for anyone doing "analytic" search, that is, they want to see every time a term/phrase appears – lawyers, patent examiners, etc.

This view of the data is fundamentally different from snippets, which typically show the three or so best chunks where the search terms appear, and they're typically ordered per document. Snippets allow the user to determine if a document is relevant, then the user has to open the document. Snippets are great if users are seeking the best document to answer their information need.

For "analytic searchers", however, with concordance results, the user can be saved the step of having to open the document; they can see every time their term/phrase appears. Also, for "analytic searchers", if their documents are lengthy, the concordance allows them to see the potentially hundreds of times that their term/phrase appears in each document instead of the three or so snippets they might see with traditional search engines.

"But you can increase the number of snippets to whatever you want..." Yes, you can, but the layout of the concordance allows you to see patterns across documents very easily. Again, the results are sorted by words to the left or right, not by which document the target appeared in.

This link shows some output from a concordancer (AntConc). Wikipedia's best description is under key word in context (KWIC). If you're into tree-ware, Oakes has a great introduction to concordances among many other useful topics!

asfimport commented 7 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

thanks @tballison, I'll have a look at your patch tomorrow.

asfimport commented 7 years ago

Tim Allison (@tballison) (migrated from JIRA)

Thank you. It was initially developed in Notepad with Groovy and Lucene 2.4...some of that feel is still evident.

asfimport commented 6 years ago

Tim Allison (@tballison) (migrated from JIRA)

A prototype ASL 2.0 application that demonstrates the utility of the concordance is available: https://github.com/mitre/rhapsode