asfimport commented 14 years ago

IndexWriter only needs an AttributeSource to do indexing.

Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed).

I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling – it means others are free to make their own attr sources.

They need not even use any of Lucene's analysis impls; eg they can integrate to other things like OpenPipeline. Or make something completely custom.

3378 is already a big step towards this: it makes IW agnostic

about which attr is "the term", and only requires that it provide a BytesRef (for flex).

Then I think #3384 would get us most of the remaining way – ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat).

Migrated from LUCENE-2309 by Michael McCandless (@mikemccand), 1 vote, resolved Sep 23 2011 Attachments: LUCENE-2309.patch, LUCENE-2309-analyzer-based.patch, LUCENE-2309-getTSFromField.patch

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We can't use attr source directly – we'd need to factor out the minimal API from TokenStream (.incrToken & .end?) and use that (thanks Robert!).

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Would this mean that after that we can move all of core Analyzers to contrib/analyzers, making one step towards getting them completely out of Lucene and into their own Apache project?

That way, we can keep in core only the AttributeSource and accompanying classes, and really allow people to pass AttributeSource which is not even an Analyzer (like you said). We can move the specific Analyzer tests to contrib/analyzers as well. The other tests in core, who don't care about analysis, can use a src/test specific AttributeSource, like TestAttributeSourceImpl ...

I'm thinking - it's ok for contrib to depend on core but not the other way around. It will however take out of core a useful feature for new users which allows fast bootstrap. That won't be the case when analyzers move out of Lucene entirely, but while they are in Lucene, we'll force everyone to download contrib/analyzers as well. So maybe we keep in core only Standard, or maybe even something simpler, again, for easy bootstrapping (like Whitespace + lowercase).

This is just a thought.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Would this mean that after that we can move all of core Analyzers to contrib/analyzers

Yes, though, I think that's orthogonal (can and should be separately done, anyway).

making one step towards getting them completely out of Lucene and into their own Apache project?

We may simply "standardize" on contrib/analyzers as the one place, instead of a new [sub-]project. To be discussed... but we really do need one place.

That way, we can keep in core only the AttributeSource and accompanying classes, and really allow people to pass AttributeSource which is not even an Analyzer (like you said). We can move the specific Analyzer tests to contrib/analyzers as well. The other tests in core, who don't care about analysis, can use a src/test specific AttributeSource, like TestAttributeSourceImpl ...

Right.

I'm thinking - it's ok for contrib to depend on core but not the other way around.

I agree.

It will however take out of core a useful feature for new users which allows fast bootstrap.

Well.. I suspect with this change users would not typically use lucene-core alone. Ie, they'd get analyzers and queryparser (if we also move it out as its own module).

That won't be the case when analyzers move out of Lucene entirely, but while they are in Lucene, we'll force everyone to download contrib/analyzers as well.

I think a single source for all analyzers will be a great step forwards for users.

So maybe we keep in core only Standard, or maybe even something simpler, again, for easy bootstrapping (like Whitespace + lowercase).

Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)...

I agree, lets not get caught up on how our tests run from build.xml! We should decouple analysis from IW as much as possible, at least to support more flexible analysis: e.g. someone doesnt want to use the TokenStream concept at all, for example.

I don't really have any opinion practically where all the analyzers go, but I do agree it would be nice if they were in one place. For example, in contrib/analyzers now we have analyzers by language, and in most cases, users should really be looking at EnglishAnalyzer as their "default" instead of StandardAnalyzer for English language, as it does Porter stemming, too.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)...

For that I proposed to have a default TestAttributeSourceImpl, which does whitespace tokenization or something. If other 'core' tests need something else, we can write specific AttributeSources for them. I hope we can avoid introducing any dependency of core on contrib.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

For that I proposed to have a default TestAttributeSourceImpl

We need a bit more than AttributeSource, at least if the text has more than one token, it must at least support incrementToken()

We could try factoring out incrementToken() and end() from TokenStream to create a "more-generic" interface, but really, there isn't much more to Tokenstream (except close and reset)

At the same time, while I really like the decorator API of TokenStream, it should be easier for someone to use a completely different API, perhaps one that feels less like you are writing a finite-state machine by hand (capture/restoreState, etc)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Or remove them entirely (but, then, core tests will need to use contrib analyzers for their testing)...

For that I proposed to have a default TestAttributeSourceImpl, which does whitespace tokenization or something. If other 'core' tests need something else, we can write specific AttributeSources for them. I hope we can avoid introducing any dependency of core on contrib.

Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Today when I "ant test-core" contrib is not built, and I like it. Also "ant test-backwards" will be affected I think ... I think if core does not depend on contrib, its tests shouldn't also. It's weird if it will.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. The IndexWriter should instead implement an interface or use a class that is called for each successful "incrementToken()" no matter how this increment is implemented.

I could imagine a really simple interface like

interface AttributeConsumer {

  void setAttributeSource(AttributeSource src);

  void next();

  void end();

}

IW would then pass itself or an istance it uses (DocInverterPerField) to an API expecting such a consumer like:

field.consume(this);

or something similar. That way we have not dependency on whatever Attribute producer is used. The default implementation is for sure based on an analyzer / tokenstream and alternatives can be exposed via expert API. Even Backwards compatibility could be solved that way easily.

Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence.

+1 test dependencies should not block modularization, its just about configuring the classpath though!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens.

[Carrying over discussions on IRC with Chris Male & Uwe...]

Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start...

However, the .reset() method is redundant from indexer's standpoint – ie when indexer calls Field.getTokenStream (say) whatever init'ing / reset'ing should already have be done by that method by the time it returns the TokenStream.

Also, .close and .end are redundant – seems like we should only have .end (few token streams do anything in .close...). But coalescing those two would be a good chunk of work at this point :) Or maybe we make a .finish that simply both by default ;)

Finally, indexer doesn't really need a Document; it just needs something abstract that's provides an iterator over all fields that need indexing (and separately, storing).

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

[Carrying over discussions on IRC with Chris Male & Uwe...]

That make it very hard to participate. I can not afford to read through all IRC stuff and I don't get the chance to participate directly unless I watch IRC constantly. We should really move back to JIRA / devlist for such discussions. There is too much going on in IRC.

Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start...

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hello, i commented yesterday but did not receive much feedback, so I want to elaborate some more:

I suppose what I was trying to mention in my earlier comment here: https://issues.apache.org/jira/browse/LUCENE-2309?focusedCommentId=12844189&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12844189

is that while I really like the new TokenStream API, i would prefer it if we thought about making this flexible enough to support "different paradigms", including perhaps something that looks a lot like the old TokenStream API.

The reason is, I notice a lot of existing code still under this old API, and I think that in some cases, perhaps its easier to work with, even if you aren't a new user. I definitely think for newer users the old API might have some advantages.

I think its useful to consider supporting such an API, perhaps as an extension in contrib/analyzers, even if its not as fast or flexible as the new API, perhaps the tradeoff of speed and flexibility would be worth the ease for newer users.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API. The code just has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. But still we must have the attribute api and the acceptor (DocInverter) must always "see" the same attribute instances (else much time would be spent to each time call getAttribute(...) for each token, if the accept method would take an AttributeSource).

The current TokenStream api could get a method taking AttributeAcceptor and simply do a while incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach for users would be to not use the TokenStream API at all and simply call the accept() method for each token on the Acceptor.

But both approaches still have te problem with the shared attributes. If you want to "record" tokens you have to implement something like my Proxy attributes. Else (as mentioned) above, most time would be spent in getAttribute() calls.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor.

This is interesting! It inverts the stack/control flow, but, would continue to use shared attrs.

So then somehow the indexer would pass its AttrAcceptor to the field? And the field would have whatever control logic it wants to feed the tokens...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start...

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API?

True, but we need at least some way to increment? AttrSource doesn't have that.

But I don't think we need reset nor close from TokenStream.

Maybe we could factor out an abstract class / interface that TokenStream impls, minus the reset & close methods?

Then people could freely use Lucene to index off a foreign analysis chain...

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

We should really move back to JIRA / devlist for such discussions

+1 !! I also find it very hard to track so many sources of discussions (JIRA, java-dev, java-user, general, and now IRC). Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google.

I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity.

And having the Field control the flow of indexing seems also dangerous ... might expose Lucene to lots of bugs by users. Today when IW controls it, it's one place to look for, but tomorrow when Field will control it, where do we look? In the app's custom Field code? In IW's atts consuming methods?

Will the Field also control how stored fields are added? Or only AttributeSourced ones?

Maybe I need to get used to this change, but currently it looks wrong to reverse the control flow. Maybe in principle the DocInverter now accepts tokens from IW, and so it looks as if we can pass it to the Field (as IW's AttAcceptor), but still the concept is different. We (IW) control the indexing flow, and not the user.

I also may not understand what will that give to users. Shouldn't users get enough flexibility w/ the current API and the Flex (once out) stuff? Do they really need to be bothered w/ pushing tokens to IW?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity.

The idea was not to change this behaviour, but also give the user the posibility to reverse that. For some tokenstreams it would simplify things much. The current IndexWriter code works exactly like that:

DocInverter gets TokenStream
DocInverter calls reset() – to be left out and moved to field/analyzer
DocInverter does while-loop on incrementToken - it iterates. On each Token it calls add() on the field consumer
DocInverter calls end() and updates end offset
DocInverter calls close() – to be left out and moved to field/analyzer

The change is simply that step (3) is removed from DocInverter which only provides the add() method for accepting Tokens. The current while loop simply is done in the current TokenStream/Field code, so nobody needs to change his code. But somebody that actively wants to push tokens can now do this. If he wants to do this currently he has no chance without heavy buffering.

So the push API will be very expert and the current TokenStreams is just a user of this API.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google.

Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a great way for people to communicate and hash stuff out, but its not necessary you follow it. If you have questions or want further elaboration, just ask. No one can expect you to follow IRC, nor is it a valid reference for where something was decided. IRC is great - I think its really benefited having devs discuss there - but the official position is, if it didn't happen on the list, it didnt actually happen.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Then people could freely use Lucene to index off a foreign analysis chain...

That is what I was talking about!

I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity.

We can surely hide this implementation completely from field. I consider this being similar to Collector where you pass it explicitly to the search method if you want to have a different behavior. Maybe something like a AttributeProducer. I don't think adding this to field makes a lot of sense at all and it is not worth the complexity.

Will the Field also control how stored fields are added? Or only AttributeSourced ones?

IMO this is only about inverted fields.

We (IW) control the indexing flow, and not the user.

The user only gets the possibility to exchange the analysis chain but not the control flow. The user already can mess around with stuff in incrementToken(), the only thing we change / invert is that the indexer does not know about TokenStreams anymore. it does not change the controlflow though.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

There is one problem that cannot be easy solved (for all proposals here), if we want to provide an old-style API that does not require reuse of tokens: The problem with AttributeProvider is that if we want to support something (like rmuir proposed before) that looks like the old "Token next()", we need an AttributeProvider that passes the AttributeSource to the indexer on each Token! And that would lead to lots of getAttribute() calls, that would slowdown indexing! So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact. This can only be solved with my nice BCEL proxy Attributes, so you can exchange the inner attribute impl. Or do it like TokenWrapper in 2.9 (yes, we can reactivate that API somehow as an easy use-addendum).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact.

I agree. I guess I'll try to simplifiy my concern: maybe we don't necessarily need something that looks like the old TokenStream API, but I feel it would be worth our time to think about supporting 'some alternative API' that makes it easier to work with lots of context across different Tokens.

I personally do not mind how this is done with the capture/restore state API, but I feel that its pretty unnatural for many developers, and in the future folks might want to do more complex analysis (maybe even light pos-tagging, etc) that requires said context, and we should plan for this.

I feel this wasn't such an issue with the old TokenStream API, but maybe there is another way to address this potential problem.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

I'd like to pick this issue up and run with it. Anyone have any new thoughts?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Great!

This will overlap w/ the field type work (we have branch for this now), where we already have decoupled indexer from concrete Field/Document impls, by adding a minimal IndexableField.

I think this issue should further that, ie pare back IndexableField so that there's only a getTokenStream for indexing (ie indexer will no longer try for String then Reader then tokenStream), and Analyzer must move to the FieldType and not be passed to IndexWriterConfig. Multi-valued fields will be tricky, since IW now asks analyzer for the gaps...

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Once Analyzer moves onto FieldTypes, I think we can deprecate/remove PerFieldAnalyzerWrapper.

I'm not sure what to do about multi-valued fields; we may have to move the getPosIncr/OffsetGap onto IndexableField, since it's not easy to ask a TokenStream to do this shifting for us?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm, with this cutover comes a new spooky degree of freedom: the ability to specify a different analyzer for each value of a multi-value'd field.

Maybe we need an explicit FieldType that is "multi-valued", and so the document would have only one instance of Field for that field name, and within that Field are multiple values? Problem is, this would lose a degree of freedom we have today, ie the ability for different values of the same field name to have wildly different types...

asfimport commented 13 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

<<<Hmm, with this cutover comes a new spooky degree of freedom: the ability to specify a different analyzer for each value of a multi-value'd field.>>>

Call me a coward, but this scares me! Answering the question "why didn't I get the results I was expecting" would become...er...somewhat more difficult (I'm studying the understatement thing). Although I guess this wouldn't be accessible from Solr, so maybe it would be another one of those "expert" features?

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Would there be any valid use for that "feature" that couldn't be accomplished in some more straightforward manner? It would be like a union, maybe: a field that is sometimes fish and other times fowl, or just a menagerie? But I can always create a group of fields of various types (in an application) that work together?

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Note, this patch is against the FieldType branch and is very, very, VERY POC patch.

Adds the ability to set Analyzer on FieldType (I haven't removed it from IWC and haven't addressed the OffSetGap issue)
Adds AttributeConsumer abstraction which has callbacks like Simon described above.
Added consume(AttributeConsumer) to Field, AttributeSource and TokenStream. TokenStream.consume() handles calling reset, incrementToken, end and close. AttributeSource provides a simple implementation. Field.consume() instantiates a TokenStream from the Analyzer in the FieldType (or uses an an existing one) and then passes on the consume call.
DocInverterPerField now uses AttributeConsumer when the Field is tokenized.
ReusableStringReader has been quickly made public so it can be used in Field
I changed one test (TestDemo) to have the Analyzer set on the FieldType.

Very POC!

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

haven't addressed the OffSetGap issue

I actually think these gaps are what we should address first. here's a rough idea:

remove offset/position increment gap from Analyzer.
instead for multivalued fields, the field handles this internally. so it returns a MultiValuedTokenstream? that does the 'concatenation'/offset/position increasing between fields itself. IndexWriter just sees one tokenstream for the field and doesn't know about this, e.g. it just consumes positions and offsets.

To do this, I think there could be problems if the analyzer does not reuse, as it should be one set of attributes to the indexer across the multivalued field.

so first to solve this problem: I think first we should remove Analyzer.tokenStream so all analyzers are reusable, and push ReusableAnalyzerBase's API down into Analyzer. We want to do this improvement anyway to solve that trap.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

While I digest your suggestion on handling the gap values,

so first to solve this problem: I think first we should remove Analyzer.tokenStream so all analyzers are reusable, and push ReusableAnalyzerBase's API down into Analyzer. We want to do this improvement anyway to solve that trap.

Beyond the heavy lifting of doing this (which I'm fine with doing), do you know off hand whether this is going to be a problem for any of the Analyzers/Tokenizer/TokenFilter impls we have? I seem to recall an issue where you looked into this.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yes, so my idea is basically that Analyzer gets ReusableAnalyzerBase's API completely, so its reusableTokenStream() is final.

In order for this to work, we need to first fix the limitation of ReusableAnalyzerBase:

 * ReusableAnalyzerBase is a simplification of Analyzer that supports easy reuse
 * for the most common use-cases. Analyzers such as
 * {`@link` PerFieldAnalyzerWrapper} that behave differently depending upon the
 * field name need to subclass Analyzer directly instead.

this looks easy to do, the current implementation always puts a TokenStreamComponents into the 'previousTokenStream'. Instead, reusableAnalyzerBase can have an optional ctor with something like ReuseStrategy, of which there are two:

GLOBAL: this is what it does now, reuse across all fields, and should be the default
PERFIELD: this one instead puts a Map<String,TokenStreamComponents> into the previous tokenstream and reuses on a per-field basis.

then the problem is solved, and we could name this thing Analyzer, make all analyzers extend it directly, remove the non-reusable TokenStream(), it would only have a final reusableTokenStream() for the consumers.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Fortunately, the patch I put up means PerFieldAnalyzerWrapper can be removed (since Analyzer becomes per FieldType). But I take your point and it will be necessary to implement what you describe to support SolrAnalyzer.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Fortunately, the patch I put up means PerFieldAnalyzerWrapper can be removed (since Analyzer becomes per FieldType).

How does this work with QueryParser and other consumers of Analyzer though?

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Good point. I guess it will continue to have a place in those instances yes.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm just asking the question because, maybe i have a different idea of "fully decouple IndexWriter from Analyzer".

If we setup an API where its different at indextime versus query-time, I'm not going to be happy because this is setting up users to fail: in general i should be able to pass my 'analysis configuration' to all the various consumers and if its the same at index/time versus query/time, things should work.

I dont think we should expose a different per-field configuration at index-time versus query-time versus this or that, I think thats a step backwards.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

You make good points and I don't have answers for you at this stage.

What I'm exploring with this direction currently is how best to consume the terms of a Field while minimizing the exposure of Analyzer / TokenStream in the indexing process. What has felt nature to me is having Analyzer at the Field level. This is already kind of implemented anyway - if a Field returns a tokenStreamValue() then thats used for indexing, no matter what the 'analysis configuration' is.

Do you have any suggestions for other directions to follow?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

What I'm exploring with this direction currently is how best to consume the terms of a Field while minimizing the exposure of Analyzer / TokenStream in the indexing process.

I thought about this a lot, I'm not sure we need to minimize TokenStream, I think it might be fine! There is really nothing much more minimal than this, its just attributesource + incrementToken() + reset() + end() + close()...

This issue is a little out of date, I actually think we have been decoupling indexwriter from analysis even more... for example the indexwriter just pulls bytes from the tokenstream, etc.

I think we have been going in the right direction and should just be factoring out the things that don't belong in the indexer or in analyzer (like this gap manipulation)

What has felt nature to me is having Analyzer at the Field level. This is already kind of implemented anyway - if a Field returns a tokenStreamValue() then thats used for indexing, no matter what the 'analysis configuration' is.

Yeah but thats currently an expert option, not the normal case. In general if we are saying IndexWriter doesn't take Analyzer but has some schema that is a list of Fields, then QueryParser etc need to take this list of fields also, so that queryparsing is consistent with analysis. But i'm not sure I like this either: I think it only couples QueryParser with IndexWriter.

Do you have any suggestions for other directions to follow?

I don't think we should try to do a huge mega-change right now given that there are various "problems" we should fix... some of this I know is my pet peeves but:

fixing the Analyzers to only be reusable is important, its a performance trap. we should do this regardless... and we can even backport this one easily to 3.x (backporting improvements to reusableAnalyzerBase, deprecating tokenStream(), etc)
removing the positionIncrement/offsetGaps from Analyzer makes total sense to me, this is "decoupling indexwriter from analyzer" because these gaps make no sense to other analyzer consumers. So I think these gaps are in the wrong place, and should instead be in Field or whatever, which creates ConcatenatedTokenStream behind the scenes to provide to IndexWriter. This is also good too, maybe you need to do other things than munge offsets/positions across these gaps and this concatenation would no longer be hardcoded in indexwriter.

I've looked at doing e.g. the first of these and I know its a huge mondo pain in the ass, these "smaller" changes are really hard and a ton of work.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I like the idea here, that is "hey analyzer, give me your tokens!". For lots of use-cases this is much easier to implement than TokenStreams. Its just a change from pull to more push tokens. The impl in TokenStream is just consuming insself and pushing the tokens. From the abstraction point of view thats much easier to understand.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

And that might be a good way, my point is if this is how we want to go, then Analyzer should instead provide AttributeConsumer, and the queryparsers etc should consume it with the same API (and tokenstream is an implementation detail behind the scenes).

But i don't think queryparser should take PerFieldAnalyzerWrapper while IndexWriter takes per-Field analyzers, I think thats confusing.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think Chris only started with the indexer as an example to show that it works. Of cource we can rewrite all other consumers to use this new api. Also BaseTSTestCase :-)

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

well if thats the direction here, then we should describe the jira issue differently: something like "abstract away TokenStream API".

because it just looks to me as if IndexWriter works off a different analysis API than other analysis consumers and I don't like that.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

I just want to re-iterate the point that I'm just exploring options here. I'm really glad to be getting feedback but no direction has been set in stone.

I thought about this a lot, I'm not sure we need to minimize TokenStream, I think it might be fine! There is really nothing much more minimal than this, its just attributesource + incrementToken() + reset() + end() + close()...

Okay interesting. I admit I don't really have any concrete thoughts on an 'alternative' to TokenStream, but wouldn't you agree that exposing a more limited API to the Indexer is beneficial here? I agree we're only talking about a handful of methods currently.

Yeah but thats currently an expert option, not the normal case. In general if we are saying IndexWriter doesn't take Analyzer but has some schema that is a list of Fields, then QueryParser etc need to take this list of fields also, so that queryparsing is consistent with analysis. But i'm not sure I like this either: I think it only couples QueryParser with IndexWriter.

Two things that jump out at me here. Firstly, IndexWriter does take a list of Fields, the Fields it indexes in each Document. We've also identified that people might want to apply different analysis per field, thats why we have PerFieldAnalyzerWrapper isn't it?

Secondly, we now have multiple QueryParsing implementations consolidated together. I hope over time we'll add more / different implementations which do things differently. So while I totally agree with the sentiment that we shouldn't make this confusing for users and that searching and indexing should work together, I'm certainly open to exploring other ways to do that.

I've looked at doing e.g. the first of these and I know its a huge mondo pain in the ass, these "smaller" changes are really hard and a ton of work.

Independent of what's decided here, I'm definitely happy to do the heavy lifting on the improvements you've suggested.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

But i don't think queryparser should take PerFieldAnalyzerWrapper while IndexWriter takes per-Field analyzers, I think thats confusing.

Yes it is.

I think Chris only started with the indexer as an example to show that it works. Of cource we can rewrite all other consumers to use this new api. Also BaseTSTestCase.

Absolutely.

well if thats the direction here, then we should describe the jira issue differently: something like "abstract away TokenStream API".

I don't think what I've implemented in the patch is so different to what has been discussed in this issue earlier. I did consider opening another issue, but I thought this JIRA issue captured the conceptual issue quite well.

because it just looks to me as if IndexWriter works off a different analysis API than other analysis consumers and I don't like that.

I'm happy to explore those other consumers and strive to provide a user friend API to limit bugs. But I'm not getting the impression you like the concept at all.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm happy to explore those other consumers and strive to provide a user friend API to limit bugs. But I'm not getting the impression you like the concept at all.

That's totally not it, but I do like the fact of having an "Analyzer" that is the central API for analysis that anything uses: IndexWriter, QueryParser, MoreLikeThis, Synonyms parsing, SpellChecking, or wherever we need it...

I think if we want this to take AttributesConsumer or whatever, then thats cool, Analyzer returns this instead of TokenStream and we fix all these consumers to consume the more general API.

I just want to make sure, that all consumers, not just IndexWriter, use the consistent API. This way, like today, someone declares FooAnalyzer, uses it everywhere, and stuff is consistent everywhere.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

I think if we want this to take AttributesConsumer or whatever, then thats cool, Analyzer returns this instead of TokenStream and we fix all these consumers to consume the more general API.

I just don't see how this would work. As it is in the patch, AttributeConsumer is a callback mechanism where the consumer provides their logic. Its nothing to do with Analyzers really and will be implemented differently depending on what the consumer wants to do in that instance.

I just want to make sure, that all consumers, not just IndexWriter, use the consistent API. This way, like today, someone declares FooAnalyzer, uses it everywhere, and stuff is consistent everywhere.

Absolutely desirable. AttributeConsumer isn't changing the Analyzer concept, its just changing how we consume from Analyzer. With that in mind, I very much agree with your assertion that this shouldn't change the Analyzer used in search and indexing. Whats prompted that concern here is the shift to per Field Analyzer. I'll reassess that change while waiting for other feedback.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I just don't see how this would work.

As long as it implements an interface that provides: void consume(AttributeConsumer attributeConsumer)

?

then, this coudl be what Analyzer returns, Consumable or whatever you want to call this :)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

To come back to decoupling: With the new API, we no longer need NumericTokenStream, as NumericField can simply push the tokens to DocInverter. So TokenStream can move out of core, but NumericField & NumericRangeQuery can stay - nice!

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Okay, I thought about this overnight and have tried to come up with a middle ground. Again, very proof-of-concept.

Analyzer now moves away from exposing TokenStream (although I've left the methods there) and now returns an AttributeSource.
Field.consume() now becomes Field.consume(AttributeConsumer, Analyzer). Here, the Analyzer is that passed into IW. This means that the Field can decide how it wants to expose its terms. The default implementation uses the Analyzer, but others can do what they like.
I've removed adding Analyzer to FieldType, but it could still be exposed as an expert option.

The overall idea is that the Fields now control how terms are given to DocInverter.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think there are two rather separate ideas here?

First, IW should not have to "know" how to get a TokenStream from a IndexableField; it should only ask the Field for the token stream and get that back and iterate its tokens.

Under the hood (in the IndexableField impl) is where the logic for tokenized or not, Reader vs String vs pre-created token stream, etc. should live, instead of hardwired inside indexer. Maybe an app has a fully custom way to make a token stream for the field...

Likewise, for multi-valued fields, IW shouldn't "see" the separate values; it should just receive a single token stream, and under the hood (in Document/Field impl) it's concatenating separate token streams, adding posIncr/offset gaps, etc. This too is now hardwired in indexer but shouldn't be. Maybe an app wants to insert custom "separator" tokens between the values...

(And I agree: as a pre-req we need to fix Analyzer to not allow non-reused token streams; else we can't concatenate w/o attr proxying/copying).

If IW still receives analyzer and simply passes it through when asking for the tokenStream I think that's fine for now. In the future, I think IW should not receive analyzer (ie, it should be agnostic to how the app creates token streams); rather, each FieldType would hold the analyzer for that field. However, that sounds contentious, so let's leave it for another day.

Second, this new idea to "invert" TokenStream into an AttrConsumer, which I think is separate? I'm actually not sure I like such an approach... it seems more confusing for simple usage? Ie, if I want to analyze some text and iterate over the tokens... suddenly, instead of a few lines of local code, I have to make a class instance with a method that receives each token? It seems more convoluted? I mean, for Lucene's limited internal usage of token stream, this is fine, but for others who consume token streams... it seems more cumbersome.

Anyway, I think we should open a separate issue for "invert TokenStream into AttrConsumer"?

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

I've thought about this issue some more and I feel there's a middle ground to be had.

First, IW should not have to "know" how to get a TokenStream from a IndexableField; it should only ask the Field for the token stream and get that back and iterate its tokens.

You're absolutely right and this should be our first step. It should be up to the Field to produce its terms, IW should just iterate through them.

Likewise, for multi-valued fields, IW shouldn't "see" the separate values; it should just receive a single token stream, and under the hood (in Document/Field impl) it's concatenating separate token streams, adding posIncr/offset gaps, etc. This too is now hardwired in indexer but shouldn't be. Maybe an app wants to insert custom "separator" tokens between the values...

I also totally agree. We should strive to reduce as much hardwiring at make it as flexible as possible. But again I see this as a step in the process.

Second, this new idea to "invert" TokenStream into an AttrConsumer, which I think is separate? I'm actually not sure I like such an approach... it seems more confusing for simple usage? Ie, if I want to analyze some text and iterate over the tokens... suddenly, instead of a few lines of local code, I have to make a class instance with a method that receives each token? It seems more convoluted? I mean, for Lucene's limited internal usage of token stream, this is fine, but for others who consume token streams... it seems more cumbersome.

I don't agree that this is separate. For me the purpose of this issue is to fully decouple IndexWriter from analyzers :) As such the how IW consumes the terms it indexes is at the heart of the issue. The inversion approach is a suggestion for how we might tackle this in a flexible and extensible way. So I don't see any reason to push it to another issue. Its a way of fulfilling this issue.

I think there is also some confusion here. I'm not suggesting we change all usage of analysis. If someone wants to consume TokenStream as is, so be it. What I'm looking at changing here is how IW gets the terms it indexes, thats all. We've introduced abstractions like IndexableField to be flexible and extensible. I don't think there's anything wrong with examining the same thing with TokenStream here.

I think Robert has stated here that he's comfortable continuing to use TokenStream as the API for IW to get the terms it indexes, is that what others feel too? I agree the inverted API I proposed is a little convoluted and I'm sure we can come up with a simple Consumable like abstraction (which Robert did also suggest above). But if people are content with TokenStream then theres no need.

apache / lucene

Fully decouple IndexWriter from analyzers [LUCENE-2309] #3385

3378 is already a big step towards this: it makes IW agnostic