Open asfimport opened 9 years ago
Michael McCandless (@mikemccand) (migrated from JIRA)
Trunk (6.0) only fix version ...
Robert Muir (@rmuir) (migrated from JIRA)
Is there some reason why you would serialize this in the commit? Fieldinfos is a much better place imo
Ryan Ernst (@rjernst) (migrated from JIRA)
+1 overall. This is sorely needed. I also think we should "level the playing field" for trunk: start by getting trunk back to the same state as 5x with the document api, so that if this is ready in time for 5.0, it can be much more easily backported.
ASF subversion and git services (migrated from JIRA)
Commit 1633312 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1633312
LUCENE-6005: make branch
ASF subversion and git services (migrated from JIRA)
Commit 1633314 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1633314
LUCENE-6005: work in progress
Michael McCandless (@mikemccand) (migrated from JIRA)
I committed the current work-in-progress to a new branch (https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6005).
I added a new FieldTypes class (holds the optional write-once schema) and Document2 (to replace Document eventually).
Net/net I think the approach can work well: it's a minimally intrusive API to optionally build up the write-once schema. You can skip the API entirely and it will "learn" your schema by seeing which Java types you are adding to your documents and setting sensible defaults accordingly. It's quite a bit simpler than the current oal.document API: no more separate XXXField nor FieldType classes.
Indexed binary tokens work, via Document2.addAtom(...) (#7051).
You can turn on/off sorting for a field, and this "translates" to the appropriate DV type; I want to improve this by letting you specify the default sort order, and also [eventually] specify collator. I plan to similarly enable highlighting.
I also added search-time APIs, e.g. newSort, newTermQuery, newRangeQuery. These methods throw clear exceptions if the field name is unknown, or it wasn't indexed with a type that "matches" that method.
There are still many issues and nocommits:
Analyzer is passed to FieldTypes now; I would like to remove it from IndexWriterConfig. To do this, I think I need to push multi-valued field handling out of IndexWriter up into "user space"... I already removed IndexableFieldType.tokenized as a first step.
Analyzers can't be serialized, so the app will have to re-initialize them on startup (like they must do anyway today with PFAW). Same for Similarity.
You can only set per-field DVF and PF.
I only cutover a couple tests, but they lose randomness since FieldTypes provides the default IWC, vs LTC.newIWC().
I had to suck in a fork of KeywordTokenizer.
ASF subversion and git services (migrated from JIRA)
Commit 1633597 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1633597
LUCENE-6005: add default sort order; don't use polymorphism with native types; add pos/offset gap; add highlighting; break out query and index analyzer
ASF subversion and git services (migrated from JIRA)
Commit 1634820 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1634820
LUCENE-6005: checkpoint current state
ASF subversion and git services (migrated from JIRA)
Commit 1634823 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1634823
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1635000 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635000
LUCENE-6005: fix test failures
ASF subversion and git services (migrated from JIRA)
Commit 1635002 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635002
LUCENE-6005: cutover to auto-prefix
ASF subversion and git services (migrated from JIRA)
Commit 1635898 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635898
LUCENE-6005: StoredDocument -> Document2
ASF subversion and git services (migrated from JIRA)
Commit 1635908 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635908
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1635912 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635912
LUCENE-6005: add sort missing first/last
ASF subversion and git services (migrated from JIRA)
Commit 1636293 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1636293
LUCENE-6005: add Date, InetAddress types; add min/maxTokenLength; add maxTokenCount; use ValueType.NONE not null; each FieldType now stores Luceneversion it was created by
ASF subversion and git services (migrated from JIRA)
Commit 1636528 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1636528
LUCENE-6005: fix sneaky auto-prefix bug, cutover more tests
ASF subversion and git services (migrated from JIRA)
Commit 1637540 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1637540
LUCENE-6005: add UNIQUE_ATOM type (for primary key fields), which IW and CheckIndex enforce; add IW.getReaderManager(); add exists filter support (enabled by default); cutover some more tests / fix nocommits
ASF subversion and git services (migrated from JIRA)
Commit 1637544 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1637544
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1638066 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1638066
LUCENE-6005: cutover more tests
ASF subversion and git services (migrated from JIRA)
Commit 1638204 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1638204
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1640053 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1640053
LUCENE-6005: checkpoint current changese
ASF subversion and git services (migrated from JIRA)
Commit 1640099 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1640099
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1642110 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642110
LUCENE-6005: checkpoint
ASF subversion and git services (migrated from JIRA)
Commit 1642229 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642229
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1642230 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642230
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1642535 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642535
LUCENE-6005: checkpoint
ASF subversion and git services (migrated from JIRA)
Commit 1642537 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642537
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1643659 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1643659
LUCENE-6005: checkpoint
ASF subversion and git services (migrated from JIRA)
Commit 1643662 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1643662
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1649347 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1649347
LUCENE-6005: merge trunk
ASF subversion and git services (migrated from JIRA)
Commit 1656281 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1656281
LUCENE-6005: checkpoint
ASF subversion and git services (migrated from JIRA)
Commit 1658277 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1658277
LUCENE-6005: merge trunk
Auto-prefix terms (#6941) is blocked because it's impossible in Lucene today to add a simple API to use it, and I don't think we should commit features that only super-experts can figure out how to use: that's evil.
The only realistic "workaround" for such new features is to instead add them directly to the various servers on top of Lucene, since they all already have nice schema APIs.
I opened #7051 to try do at least a baby step towards making it easier to use auto-prefix terms, so you can easily add singleton binary tokens, but even that has proven controversial.
Net/net I think we have to solve the root cause of this by fixing the Document/Field/FieldType API so that new index-level features can have a usable API, properly defaulted for the right types of fields.
Towards that, I'm exploring a replacement for Document/Field/FieldType. The idea is to expose simple methods on the document class (no more separate Field and FieldType classes):
And then expose a separate FieldTypes class, that you pass to ctor of the new document class, which lets you set all the various per-field settings (stored, doc values, etc.). E.g.:
FieldTypes is a write-once schema, and it throws exceptions if you try to make invalid changes once a given setting is already written (e.g. enabling norms after having disabled them). It will (I haven't implemented this yet) save its state into IndexWriter's commitData, so it's available when you open a new IndexWriter for append and when you open a reader.
It has methods to set all the per-field settings (analyzer, stored, term vectors, norms, index options, doc values type), and chooses "reasonable" defaults based on the value's type when it suddenly sees a new field. For example, when you add a number, it's indexed for range querying and sorting (numeric doc values) by default.
FieldTypes provides the analyzer and codec (a little messy) that you pass to IndexWriterConfig. Since it's effectively a persistent schema, it knows all about the available fields at search time, so we could use it to create queries (checking if they are valid given that field's type). Query parsers and highlighters could consult it. Default UIs (above Lucene) could use it, etc. This is all future .. I think for this issue the goal should be to "just" provide a "better" index-time API but not yet make use of it at search time.
So with this change, for auto-prefix terms, we could add an "enable range queries/filters" option, but then validate that the selected postings format supports such an option.
I know this exploration will be horribly controversial, but realistically I don't think Lucene can move on much further if we can't finally address this schema problem head on.
This is long overdue.
Migrated from LUCENE-6005 by Michael McCandless (@mikemccand), 1 vote, updated May 09 2016