apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Fully decouple IndexWriter from analyzers [LUCENE-2309] #3385

Closed asfimport closed 13 years ago

asfimport commented 14 years ago

IndexWriter only needs an AttributeSource to do indexing.

Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed).

I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling – it means others are free to make their own attr sources.

They need not even use any of Lucene's analysis impls; eg they can integrate to other things like OpenPipeline. Or make something completely custom.

3378 is already a big step towards this: it makes IW agnostic

about which attr is "the term", and only requires that it provide a BytesRef (for flex).

Then I think #3384 would get us most of the remaining way – ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat).


Migrated from LUCENE-2309 by Michael McCandless (@mikemccand), 1 vote, resolved Sep 23 2011 Attachments: LUCENE-2309.patch, LUCENE-2309-analyzer-based.patch, LUCENE-2309-getTSFromField.patch

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think Robert has stated here that he's comfortable continuing to use TokenStream as the API for IW to get the terms it indexes, is that what others feel too? I agree the inverted API I proposed is a little convoluted and I'm sure we can come up with a simple Consumable like abstraction (which Robert did also suggest above). But if people are content with TokenStream then theres no need.

I feel the same. The API of TokenStream is so stupid-simple, why replace it by another push-like API that is not simplier nor more complicated, just different? I see no reason in this. IW should simply request a TokenStream from the field and consume it.

Likewise, for multi-valued fields, IW shouldn't "see" the separate values; it should just receive a single token stream, and under the hood (in Document/Field impl) it's concatenating separate token streams, adding posIncr/offset gaps, etc. This too is now hardwired in indexer but shouldn't be. Maybe an app wants to insert custom "separator" tokens between the values...

I agree with that, too. There is one problem with this: Concenatting TokenStreams is not easy to do, as they have different attribute instances, so IW getting all attributes at the start would then somehow in the middle of the TS have to change the attributes.

To implement this fast (without wrapping and copying), we need some notification that the consumer of a TokenStream needs to "request" the attribute instances again, but this is a "bad" idea. For me the only simple solutions to this problem is to make the Field return an iterator of TokenStreams and IW consumes them one after each other, and doing the addAttribute before each separate instance.

About the PosIncr Gap: The field can change the final offsets/posIncr in end() before handling over to a new TokenStream. IW would only consume TokenStreams one by one.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

I feel the same. The API of TokenStream is so stupid-simple, why replace it by another push-like API that is not simplier nor more complicated, just different? I see no reason in this. IW should simply request a TokenStream from the field and consume it.

But you do favour a pull-like API as an alternative?

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

But you do favour a pull-like API as an alternative?

TokenStream is pull and I do favour this one.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Err yes sorry you're right.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I agree with that, too. There is one problem with this: Concenatting TokenStreams is not easy to do, as they have different attribute instances, so IW getting all attributes at the start would then somehow in the middle of the TS have to change the attributes.

I don't think the attributes should be allowed to change here. This is why above i already said, we should enforce reusability. Then there is no problem.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Getting back on this after the Analyzer work.

New patch is far more traditional and adds tokenStream(Analyzer) to IndexableField. This replaces tokenStreamValue(). Consumers wishing to index a field now call tokenStream(Analyzer) which is responsible to create the appropriate TokenStream for the field.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Looks much more straigtforward now. I like this implementation.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Looks much more straigtforward now. I like this implementation.

+1 looks good though. much simpler too!

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This looks great! I love all the -'d code from DocInverterPerField ;)

In Field.java do we already check that if the field is not tokenized then it has a non-null stringValue()?

I would like to for IW to not have to pass through the Analyzer here (ie FieldType should know the Analyzer for that field), but let's save that for another issue/time.

Likewise, multi-valued field should ideally be "under the hood" from IW's standpoint, ie we should have a MultiValuedField and you append to a List inside it, and then IW gets a single TokenStream from that, which does its own concatenating of the separate TokenStreams, but we should tackle that under a separate issue.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

In Field.java do we already check that if the field is not tokenized then it has a non-null stringValue()?

I don't think we do. Its always been implied (which could cause a bug). I'll add the appropriate checks but we really need to revisit the constructors of Field at some stage.

I would like to for IW to not have to pass through the Analyzer here (ie FieldType should know the Analyzer for that field), but let's save that for another issue/time.

I totally agree. Theoretically FieldType could have Analyzer added to it now and it could make use of it. But removing the Analyzer from IW seems controversial, alas :)

Likewise, multi-valued field should ideally be "under the hood" from IW's standpoint, ie we should have a MultiValuedField and you append to a List inside it, and then IW gets a single TokenStream from that, which does its own concatenating of the separate TokenStreams, but we should tackle that under a separate issue.

Its nearly possible. We've almost there on the reusable Analyzers. This can already begin actually for non-tokenized fields and for NumericFields.

I'll make the non-null StringValue checks and then commit.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Committed revision 1174506.

I think this issue is wrapped and we can spin the other improvements off?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yeah I think we are done here! Nice work.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Change has been committed. We'll spin the multiValued fields work off as a separate issue.