Closed asfimport closed 8 years ago
ASF GitHub Bot (migrated from JIRA)
GitHub user PaulElschot opened a pull request:
https://github.com/apache/lucene-solr/pull/46
Labeledfragments 201404a
LUCENE-5627
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/PaulElschot/lucene-solr labeledfragments-201404a
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/lucene-solr/pull/46.patch
To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:
This closes `#46`
commit a214913ac2143277d9539a3e9e3d1cd1662b754a Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-03-12T21:21:13Z
Squashed commit of efbytesref, 20140312
commit 4c3db731b634365fb50df35f3eea562c9b51015a Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-04-22T23:06:06Z
Squashed commit of labeled fragments code.
Paul Elschot (migrated from JIRA)
This adds a module called "label" as a prototype for index-time positional joins by labeled text fragments.
This provides a 1 : 0..n positional join. It is a generalization of FieldMaskingSpanQuery that provides a 1 : 1 positional join.
At indexing time labeled text fragments for a document are analysed from a TokenStream.
In package org.apache.lucene.analysis.label such a labeled fragments stream is split into a label stream, and into pairs of streams for fragments and fragment positions. A fragment is series of tokens, possibly empty. The fragments in each fragment stream will be contiguous, the labels and the other fragment streams have no influence on their positions.
The output streams can be used to provide documents with different fields per stream. It is up to the user to associate the output streams with fields in documents to be indexed for search.
Labels and fragments are represented at query time by Spans. Querying labeled fragments with positional joins is supported in package org.apache.lucene.search.spans.label.
This implementation uses EliasFanoBytes (#6587) to compress a payload with start/end positions. These have a value index, which allows for fast fragment to label associations. Currently these have no position index, so label to fragment associations will be somewhat slower. Since payloads need to be loaded completely during searches, this will not have high performance for larger payloads.
This is a prototype because I don't expect high performance for larger payloads. All code javadocs are marked experimental.
Paul Elschot (migrated from JIRA)
The name "label" is already used in the facet module in Lucene, e.g. FacetLabel.java. I don't think this is problematic, but in case this causes confusion another name could be used here.
Paul Elschot (migrated from JIRA)
The pull request also changes the javadocs of the join module to use "document join" instead of just "join".
Paul Elschot (migrated from JIRA)
The commit for the PR also contains LeafFragmentsQuery.java, which is actually not needed here now. It slipped in from an extension that puts the labels in a tree. I'll open an issue for that later...
ASF GitHub Bot (migrated from JIRA)
GitHub user PaulElschot opened a pull request:
https://github.com/apache/lucene-solr/pull/51
Labeledtree 201405a1
LUCENE-5627
This closes `#46`
Major overhaul:
Added an analyzer for xml: tags as labels, attributes and text as fragments.
Based on recent efbytesref, #6587, which is based on recent trunk.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/PaulElschot/lucene-solr labeledtree-201405a1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/lucene-solr/pull/51.patch
To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:
This closes `#51`
commit d35d6ceb669d30dc169dd3e2fbb993833e2c1a82 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-05-13T16:13:24Z
efbytesref 201405a1
commit 130cf392b6080b72ff671e463d59742d5bce7f23 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-05-13T16:26:10Z
labeledtree 201405a1
Paul Elschot (migrated from JIRA)
When #6749 is committed the PrefillTokenStream class should be used from the analysis common module, and removed here.
Paul Elschot (migrated from JIRA)
The javadocs here contain some references on what was used to make this. Meanwhile I had another look around and found two somewhat similar implementations:
Luxdb: https://github.com/msokolov/lux This uses a TaggedTokenStream for the XML tags, see http://www.slideshare.net/lucenerevolution/querying-rich-text-with-xquery
Fangorn: https://code.google.com/p/fangorn/ This indexes each tag by adding a payload with four position numbers (left, right, depth, parent). Its target is large treebanks of linguistically parsed text.
A first impression: Both are based on Lucene and add a tree of XML tags like the label tree here. They have a query language implementation which is not available here. They do not have labeled fragments in the sense of having 0..n tokens in more than one field that can form a single leaf in the tag tree.
Paul Elschot (migrated from JIRA)
Created #6820 to extend SpanQueryParser with positional joins.
Paul Elschot (migrated from JIRA)
I have started on code for a field schema for the positional joins. So far this affects only the test code here; it involves replacing a lot of constants with references to the schema.
The idea is to post this schema here when it can also provide positional join queries to the extended SpanQueryParser.
ASF GitHub Bot (migrated from JIRA)
GitHub user PaulElschot opened a pull request:
https://github.com/apache/lucene-solr/pull/61
Labeledtree 201406a
LUCENE-5627
This closes `#51`
Add LabelFieldSchema and PositionalJoinQueryFactory for use in a query parser.
Improved module dependencies for label module.
Based on recent efbytesref, #6587, which is based on recent trunk.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/PaulElschot/lucene-solr labeledtree-201406a
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/lucene-solr/pull/61.patch
To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:
This closes `#61`
commit 5ef98c1a655d9b79a01f8ffc884676076dbbec47 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-06-29T09:40:04Z
efbytesref as of 20140629
commit f7f5dd63f55bf54c38bf3535df58d4fec557d626 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-06-29T09:48:59Z
labeledtree as of 20140629
ASF GitHub Bot (migrated from JIRA)
GitHub user PaulElschot opened a pull request:
https://github.com/apache/lucene-solr/pull/63
Labeledtree 201407a
Labeledtree as of 20140717
LUCENE-5627
This closes `#61`
Use eliasfano package (#6587) and move LongsInBytes to new o.a.l.packed.label package.
Otherwise no changes to pull request 61.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/PaulElschot/lucene-solr labeledtree-201407a
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/lucene-solr/pull/63.patch
To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:
This closes `#63`
commit 51a76c7789f38c2644f4019552a41c8199884557 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-07-17T19:59:29Z
efbytesref of 20140717, move Elias-Fano code to eliasfano package
commit 1e476b37a6b6df8f47426285dd82c9b93cd75643 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-07-17T20:04:37Z
labeledtree of 20140717, use eliasfano package, move LongsInBytes to new o.a.l.packed.label package
ASF GitHub Bot (migrated from JIRA)
GitHub user PaulElschot opened a pull request:
https://github.com/apache/lucene-solr/pull/87
Labeledtree 201408a
LUCENE-5627
This closes `#63`
Update to recent trunk.
Move LongsInBytes to correct package, this was caught by building javadocs.
In the test code remove version arguments to IndexWriterConfig and WhitespaceTokenizer.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/PaulElschot/lucene-solr labeledtree-201408a
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/lucene-solr/pull/87.patch
To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:
This closes `#87`
commit 0af9647ae410688219367e1e445aa27abf485e4e Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-08-17T20:21:42Z
efbytesref of 20140817
commit 218e5ba30cbbcd1e5eda70869a3252eb4b525c55 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-08-17T20:23:28Z
labeledtree of 20140817
Paul Elschot (migrated from JIRA)
Update to trunk of today, depends on #6587 of today
Paul Elschot (migrated from JIRA)
Patch of 20150525 against 5.2 branch. Includes PrefillTokenStream from #6749 and an EliasFano sequence of #6587.
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot commented on the pull request:
https://github.com/apache/lucene-solr/pull/87#issuecomment-105224253
Superseded by today's patch at LUCENE-5627
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot closed the pull request at:
https://github.com/apache/lucene-solr/pull/87
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot commented on the pull request:
https://github.com/apache/lucene-solr/pull/86#issuecomment-105224437
Superseded by today's patch at LUCENE-5627
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot commented on the pull request:
https://github.com/apache/lucene-solr/pull/63#issuecomment-105225140
See LUCENE-5627
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot closed the pull request at:
https://github.com/apache/lucene-solr/pull/63
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot commented on the pull request:
https://github.com/apache/lucene-solr/pull/61#issuecomment-105225260
See LUCENE-5627
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot closed the pull request at:
https://github.com/apache/lucene-solr/pull/61
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot commented on the pull request:
https://github.com/apache/lucene-solr/pull/51#issuecomment-105225333
See LUCENE-5627
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot closed the pull request at:
https://github.com/apache/lucene-solr/pull/51
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot commented on the pull request:
https://github.com/apache/lucene-solr/pull/46#issuecomment-105225447
See LUCENE-5627
ASF GitHub Bot (migrated from JIRA)
Github user PaulElschot closed the pull request at:
https://github.com/apache/lucene-solr/pull/46
Paul Elschot (migrated from JIRA)
The patch of 20150525 against the 5.2 branch has:
The util.eliasfano package has a new BitSelect class which is based on the recently removed BitUtil.select method. I'll move the util.eliasfano package into the label module here later, in the patch it is in the core module.
The patch was prepared on an svn checkout of the 5.2 branch. I'd rather use git, but the 5.2 branch is not yet available in the git mirror, see INFRA-9182.
Prototype of analysis and search for labeled fragments
Migrated from LUCENE-5627 by Paul Elschot, resolved Apr 20 2016 Attachments: LUCENE-5627-20141126.patch, LUCENE-5627-20150525.patch