Explore alternative to Document/Field/FieldType API [LUCENE-6005]

asfimport commented 9 years ago

Auto-prefix terms (#6941) is blocked because it's impossible in Lucene today to add a simple API to use it, and I don't think we should commit features that only super-experts can figure out how to use: that's evil.

The only realistic "workaround" for such new features is to instead add them directly to the various servers on top of Lucene, since they all already have nice schema APIs.

I opened #7051 to try do at least a baby step towards making it easier to use auto-prefix terms, so you can easily add singleton binary tokens, but even that has proven controversial.

Net/net I think we have to solve the root cause of this by fixing the Document/Field/FieldType API so that new index-level features can have a usable API, properly defaulted for the right types of fields.

Towards that, I'm exploring a replacement for Document/Field/FieldType. The idea is to expose simple methods on the document class (no more separate Field and FieldType classes):

    doc.addLargeText("body", "some text");
    doc.addShortText("title", "a title");
    doc.addAtom("id", "29jafnn");
    doc.addBinary("bytes", new byte[7]);
    doc.addNumber("number", 17);

And then expose a separate FieldTypes class, that you pass to ctor of the new document class, which lets you set all the various per-field settings (stored, doc values, etc.). E.g.:

    types.enableStored("id");

FieldTypes is a write-once schema, and it throws exceptions if you try to make invalid changes once a given setting is already written (e.g. enabling norms after having disabled them). It will (I haven't implemented this yet) save its state into IndexWriter's commitData, so it's available when you open a new IndexWriter for append and when you open a reader.

It has methods to set all the per-field settings (analyzer, stored, term vectors, norms, index options, doc values type), and chooses "reasonable" defaults based on the value's type when it suddenly sees a new field. For example, when you add a number, it's indexed for range querying and sorting (numeric doc values) by default.

FieldTypes provides the analyzer and codec (a little messy) that you pass to IndexWriterConfig. Since it's effectively a persistent schema, it knows all about the available fields at search time, so we could use it to create queries (checking if they are valid given that field's type). Query parsers and highlighters could consult it. Default UIs (above Lucene) could use it, etc. This is all future .. I think for this issue the goal should be to "just" provide a "better" index-time API but not yet make use of it at search time.

So with this change, for auto-prefix terms, we could add an "enable range queries/filters" option, but then validate that the selected postings format supports such an option.

I know this exploration will be horribly controversial, but realistically I don't think Lucene can move on much further if we can't finally address this schema problem head on.

This is long overdue.

Migrated from LUCENE-6005 by Michael McCandless (@mikemccand), 1 vote, updated May 09 2016

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Trunk (6.0) only fix version ...

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Is there some reason why you would serialize this in the commit? Fieldinfos is a much better place imo

asfimport commented 9 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

+1 overall. This is sorely needed. I also think we should "level the playing field" for trunk: start by getting trunk back to the same state as 5x with the document api, so that if this is ready in time for 5.0, it can be much more easily backported.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1633312 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1633312

LUCENE-6005: make branch

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1633314 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1633314

LUCENE-6005: work in progress

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I committed the current work-in-progress to a new branch (https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6005).

I added a new FieldTypes class (holds the optional write-once schema) and Document2 (to replace Document eventually).

Net/net I think the approach can work well: it's a minimally intrusive API to optionally build up the write-once schema. You can skip the API entirely and it will "learn" your schema by seeing which Java types you are adding to your documents and setting sensible defaults accordingly. It's quite a bit simpler than the current oal.document API: no more separate XXXField nor FieldType classes.

Indexed binary tokens work, via Document2.addAtom(...) (#7051).

You can turn on/off sorting for a field, and this "translates" to the appropriate DV type; I want to improve this by letting you specify the default sort order, and also [eventually] specify collator. I plan to similarly enable highlighting.

I also added search-time APIs, e.g. newSort, newTermQuery, newRangeQuery. These methods throw clear exceptions if the field name is unknown, or it wasn't indexed with a type that "matches" that method.

There are still many issues and nocommits:

Analyzer is passed to FieldTypes now; I would like to remove it from IndexWriterConfig. To do this, I think I need to push multi-valued field handling out of IndexWriter up into "user space"... I already removed IndexableFieldType.tokenized as a first step.
Analyzers can't be serialized, so the app will have to re-initialize them on startup (like they must do anyway today with PFAW). Same for Similarity.
You can only set per-field DVF and PF.
I only cutover a couple tests, but they lose randomness since FieldTypes provides the default IWC, vs LTC.newIWC().
I had to suck in a fork of KeywordTokenizer.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1633597 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1633597

LUCENE-6005: add default sort order; don't use polymorphism with native types; add pos/offset gap; add highlighting; break out query and index analyzer

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1634820 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1634820

LUCENE-6005: checkpoint current state

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1634823 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1634823

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1635000 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635000

LUCENE-6005: fix test failures

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1635002 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635002

LUCENE-6005: cutover to auto-prefix

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1635898 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635898

LUCENE-6005: StoredDocument -> Document2

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1635908 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635908

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1635912 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1635912

LUCENE-6005: add sort missing first/last

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636293 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1636293

LUCENE-6005: add Date, InetAddress types; add min/maxTokenLength; add maxTokenCount; use ValueType.NONE not null; each FieldType now stores Luceneversion it was created by

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1636528 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1636528

LUCENE-6005: fix sneaky auto-prefix bug, cutover more tests

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1637540 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1637540

LUCENE-6005: add UNIQUE_ATOM type (for primary key fields), which IW and CheckIndex enforce; add IW.getReaderManager(); add exists filter support (enabled by default); cutover some more tests / fix nocommits

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1637544 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1637544

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1638066 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1638066

LUCENE-6005: cutover more tests

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1638204 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1638204

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1640053 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1640053

LUCENE-6005: checkpoint current changese

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1640099 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1640099

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1642110 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642110

LUCENE-6005: checkpoint

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1642229 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642229

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1642230 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642230

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1642535 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642535

LUCENE-6005: checkpoint

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1642537 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1642537

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1643659 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1643659

LUCENE-6005: checkpoint

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1643662 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1643662

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1649347 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1649347

LUCENE-6005: merge trunk

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1656281 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1656281

LUCENE-6005: checkpoint

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1658277 from @mikemccand in branch 'dev/branches/lucene6005' https://svn.apache.org/r1658277

LUCENE-6005: merge trunk

apache / lucene

Explore alternative to Document/Field/FieldType API [LUCENE-6005] #7067