apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Positional joins [LUCENE-5627] #6689

Closed asfimport closed 8 years ago

asfimport commented 10 years ago

Prototype of analysis and search for labeled fragments


Migrated from LUCENE-5627 by Paul Elschot, resolved Apr 20 2016 Attachments: LUCENE-5627-20141126.patch, LUCENE-5627-20150525.patch

asfimport commented 10 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user PaulElschot opened a pull request:

https://github.com/apache/lucene-solr/pull/46

Labeledfragments 201404a

LUCENE-5627

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PaulElschot/lucene-solr labeledfragments-201404a

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/46.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#46`

commit a214913ac2143277d9539a3e9e3d1cd1662b754a Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-03-12T21:21:13Z

Squashed commit of efbytesref, 20140312

commit 4c3db731b634365fb50df35f3eea562c9b51015a Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-04-22T23:06:06Z

Squashed commit of labeled fragments code.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

This adds a module called "label" as a prototype for index-time positional joins by labeled text fragments.

This provides a 1 : 0..n positional join. It is a generalization of FieldMaskingSpanQuery that provides a 1 : 1 positional join.

At indexing time labeled text fragments for a document are analysed from a TokenStream.

In package org.apache.lucene.analysis.label such a labeled fragments stream is split into a label stream, and into pairs of streams for fragments and fragment positions. A fragment is series of tokens, possibly empty. The fragments in each fragment stream will be contiguous, the labels and the other fragment streams have no influence on their positions.

The output streams can be used to provide documents with different fields per stream. It is up to the user to associate the output streams with fields in documents to be indexed for search.

Labels and fragments are represented at query time by Spans. Querying labeled fragments with positional joins is supported in package org.apache.lucene.search.spans.label.

This implementation uses EliasFanoBytes (#6587) to compress a payload with start/end positions. These have a value index, which allows for fast fragment to label associations. Currently these have no position index, so label to fragment associations will be somewhat slower. Since payloads need to be loaded completely during searches, this will not have high performance for larger payloads.

This is a prototype because I don't expect high performance for larger payloads. All code javadocs are marked experimental.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

The name "label" is already used in the facet module in Lucene, e.g. FacetLabel.java. I don't think this is problematic, but in case this causes confusion another name could be used here.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

The pull request also changes the javadocs of the join module to use "document join" instead of just "join".

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

The commit for the PR also contains LeafFragmentsQuery.java, which is actually not needed here now. It slipped in from an extension that puts the labels in a tree. I'll open an issue for that later...

asfimport commented 10 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user PaulElschot opened a pull request:

https://github.com/apache/lucene-solr/pull/51

Labeledtree 201405a1

LUCENE-5627

This closes `#46`

Major overhaul:

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PaulElschot/lucene-solr labeledtree-201405a1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/51.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#51`

commit d35d6ceb669d30dc169dd3e2fbb993833e2c1a82 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-05-13T16:13:24Z

efbytesref 201405a1

commit 130cf392b6080b72ff671e463d59742d5bce7f23 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-05-13T16:26:10Z

labeledtree 201405a1

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

When #6749 is committed the PrefillTokenStream class should be used from the analysis common module, and removed here.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

The javadocs here contain some references on what was used to make this. Meanwhile I had another look around and found two somewhat similar implementations:

Luxdb: https://github.com/msokolov/lux This uses a TaggedTokenStream for the XML tags, see http://www.slideshare.net/lucenerevolution/querying-rich-text-with-xquery

Fangorn: https://code.google.com/p/fangorn/ This indexes each tag by adding a payload with four position numbers (left, right, depth, parent). Its target is large treebanks of linguistically parsed text.

A first impression: Both are based on Lucene and add a tree of XML tags like the label tree here. They have a query language implementation which is not available here. They do not have labeled fragments in the sense of having 0..n tokens in more than one field that can form a single leaf in the tag tree.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

Created #6820 to extend SpanQueryParser with positional joins.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

I have started on code for a field schema for the positional joins. So far this affects only the test code here; it involves replacing a lot of constants with references to the schema.

The idea is to post this schema here when it can also provide positional join queries to the extended SpanQueryParser.

asfimport commented 10 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user PaulElschot opened a pull request:

https://github.com/apache/lucene-solr/pull/61

Labeledtree 201406a

LUCENE-5627

This closes `#51`

Add LabelFieldSchema and PositionalJoinQueryFactory for use in a query parser.

Improved module dependencies for label module.

Based on recent efbytesref, #6587, which is based on recent trunk.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PaulElschot/lucene-solr labeledtree-201406a

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/61.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#61`

commit 5ef98c1a655d9b79a01f8ffc884676076dbbec47 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-06-29T09:40:04Z

efbytesref as of 20140629

commit f7f5dd63f55bf54c38bf3535df58d4fec557d626 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-06-29T09:48:59Z

labeledtree as of 20140629

asfimport commented 10 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user PaulElschot opened a pull request:

https://github.com/apache/lucene-solr/pull/63

Labeledtree 201407a

Labeledtree as of 20140717

LUCENE-5627

This closes `#61`

Use eliasfano package (#6587) and move LongsInBytes to new o.a.l.packed.label package.
Otherwise no changes to pull request 61.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PaulElschot/lucene-solr labeledtree-201407a

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/63.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#63`

commit 51a76c7789f38c2644f4019552a41c8199884557 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-07-17T19:59:29Z

efbytesref of 20140717, move Elias-Fano code to eliasfano package

commit 1e476b37a6b6df8f47426285dd82c9b93cd75643 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-07-17T20:04:37Z

labeledtree of 20140717, use eliasfano package, move LongsInBytes to new o.a.l.packed.label package

asfimport commented 10 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user PaulElschot opened a pull request:

https://github.com/apache/lucene-solr/pull/87

Labeledtree 201408a

LUCENE-5627

This closes `#63`

Update to recent trunk.
Move LongsInBytes to correct package, this was caught by building javadocs.
In the test code remove version arguments to IndexWriterConfig and WhitespaceTokenizer.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PaulElschot/lucene-solr labeledtree-201408a

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/87.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#87`

commit 0af9647ae410688219367e1e445aa27abf485e4e Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-08-17T20:21:42Z

efbytesref of 20140817

commit 218e5ba30cbbcd1e5eda70869a3252eb4b525c55 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-08-17T20:23:28Z

labeledtree of 20140817

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Update to trunk of today, depends on #6587 of today

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Patch of 20150525 against 5.2 branch. Includes PrefillTokenStream from #6749 and an EliasFano sequence of #6587.

asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot commented on the pull request:

https://github.com/apache/lucene-solr/pull/87#issuecomment-105224253

Superseded by today's patch at LUCENE-5627
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot closed the pull request at:

https://github.com/apache/lucene-solr/pull/87
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot commented on the pull request:

https://github.com/apache/lucene-solr/pull/86#issuecomment-105224437

Superseded by today's patch at LUCENE-5627
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot commented on the pull request:

https://github.com/apache/lucene-solr/pull/63#issuecomment-105225140

See LUCENE-5627
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot closed the pull request at:

https://github.com/apache/lucene-solr/pull/63
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot commented on the pull request:

https://github.com/apache/lucene-solr/pull/61#issuecomment-105225260

See LUCENE-5627
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot closed the pull request at:

https://github.com/apache/lucene-solr/pull/61
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot commented on the pull request:

https://github.com/apache/lucene-solr/pull/51#issuecomment-105225333

See LUCENE-5627
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot closed the pull request at:

https://github.com/apache/lucene-solr/pull/51
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot commented on the pull request:

https://github.com/apache/lucene-solr/pull/46#issuecomment-105225447

See LUCENE-5627
asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot closed the pull request at:

https://github.com/apache/lucene-solr/pull/46
asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

The patch of 20150525 against the 5.2 branch has:

The util.eliasfano package has a new BitSelect class which is based on the recently removed BitUtil.select method. I'll move the util.eliasfano package into the label module here later, in the patch it is in the core module.

The patch was prepared on an svn checkout of the 5.2 branch. I'd rather use git, but the 5.2 branch is not yet available in the git mirror, see INFRA-9182.