Make graph-based TokenFilters easier [LUCENE-5012]

asfimport commented 11 years ago

SynonymFilter has two limitations today:

It cannot create positions, so eg dns -> domain name service creates blatantly wrong highlights (SOLR-3390, #5565 and others).
It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen.

I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach.

Migrated from LUCENE-5012 by Michael McCandless (@mikemccand), 5 votes, updated Dec 17 2017 Attachments: LUCENE-5012.patch (versions: 2)

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Initial patch showing the approach. This patch also includes write-once attr bindings (#3524). There are some big changes here:

All the changes from #3524: no global attribute bindings; instead each stage owns/controls what the next stage can see.
Instead of PosInc/LenAtt, there is ArcAttribute, that has from and to node (ints). Tokens are arcs, and are free to have arbitrary to/from.
An "adapter" StageAnalyzer that takes the write-once Stage and creates an analyzer, converting the ArcAtribute into PosIncAtt in the end. So, positions are only "assigned" in the final adapter stage (StageToTokenStream).
A SynonymFilterStage that fixes the above two issues (and is also quite a bit simpler).
A SplitOnDashStage that shows how a decompounder works.
Holes are done with a new DeletedAttribute, i.e. the token still runs through the entire chain, but it's marked as deleted so that stages along the way know to ignore it. E.g. this would make it possible for a tokenizer to produce punctuation tokens that are skipped for indexing but prevent a SynonymFilter from matching "over" the punctuation.

There is some added tracking of nodes that are not "done yet", necessary to allow incremental consumption of the graph by all stages. It adds some hair to graph stages but I don't see how to simplify it while keeping incrementality...

One nice side effect of this change is it's no longer possible to create a first token with position=-1, since the mapping of node id -> position is done for you.

Also, the graph is intact throughout the chain, until the very end where it is "cast" to a sausage (what indexer requires), vs today where SynonynmFilter does its own sausagizing.

While the patch is a just a prototype and there's still tons to do (long, long ways from committing, very much "exploratory"), I think it's far enough along that it shows the promise of both write-once attr bindings and an easier API for graph-based analysis components. Tricky cases that don't work with TokenStream today, e.g. a decompounder followed by a syn filter, do work in the patch.

asfimport commented 11 years ago

Jack Krupansky (migrated from JIRA)

Will this Jira include some test code that query parsers can use so that they can retrieve the graph for a stream containing multiple multi-term synonyms so that they can then individually sausage the term sequences as well as generate "OR" operators for string of sausages?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Will this Jira include some test code that query parsers can use so that they can retrieve the graph for a stream containing multiple multi-term synonyms so that they can then individually sausage the term sequences as well as generate "OR" operators for string of sausages?

I think so ... the test case (TestStages) sort of does that already: it turns the tokens into an automaton, to check that the automaton accepts the specified sequence of tokens (and ONLY those sequences of tokens).

But, I think ideally we'd have a WordAutomatonQuery, which could take the Automaton directly and match documents using that, instead of enumerating all phrases an OR'ing them (which would be equivalent but presumably slower...). I would do WordAutomatonQuery on a separate issue ...

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This looks very promising! I've been looking at a few TokenFilters recently and anything that would make working with graphs easier is very welcome! Maybe we should create a branch to make it easier to collaborate and to track incremental updates?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Good idea Adrien! I'll cut a branch and commit the patch ...

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene5012 commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1484977

LUCENE-5012: create branch

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene5012 commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1484980

LUCENE-5012: initial prototype

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I committed the initial patch to this branch: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5012

asfimport commented 11 years ago

Jack Krupansky (migrated from JIRA)

WordAutomatonQuery

Sounds quite promising.

Back to the query parsers... So, they would present a term or quoted string - and eventually hopefully a sequence of terms if the query parser sees that there is only white space between them (an issue Robert filed long ago) - and invoke analysis. Then what? Sometimes a single term or a clean sausage string comes out and a TermQuery or simple BooleanQuery or PhraseQuery needs to be generated, but if synonym-like filtering has generated a graph, then the query parser would hand "it" directly to WordAutomatonQuery, if I understand correctly. Then the question becomes how to tell that a WordAutomatonQuery graph is needed - unless WordAutomatonQuery automatically detects the cases for TermQuery and BooleanQuery/PhraseQuery as internal optimizations. (Well, I don't expect that WordAutomatonQuery would know how to do BooleanQuery vs. PhraseQuery, unless it has a "phrase" flag.)

In short, it would be nice if this issue directly or at least partially produced enough logic for that Term vs. Boolean vs. Phrase vs. WordAutomaton Query generation. Either to actually generate the final query, or at least some example code that documents the design pattern that a query parser needs for consumption of a "query phrase" graph.

In other words, the query parsers should not simply do a "next" for the entire output of query term analysis. A new design pattern is needed.

Also, at index time, the output of analysis is consumed as a single sausage stream, using "next" and token position increment, but any multiple multi-word synonyms would traditionally get somewhat mangled. There may not be a clean solution for the current index term posting format, but at a minimum we should reconsider how the output of index-time term analysis is consumed and flag potential improvements for the future for posting of multiple multi-term phrases at the same token position.

In any case, thanks for moving the multiple multi-term synonym ball forward!

asfimport commented 11 years ago

Artem Lukanin (migrated from JIRA)

Great! I was asking several people about it at Lucene/Solr Revolution 2013. Just don't forget, that if you set your default operator to AND, you should still use ORs for synonyms. I was trying to solve this problem partly in SOLR-4533. I'm not sure, how the combination of ORs and ANDs is done in an FSA if you are going to use WordAutomaton right away.

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene5012 commit] mikemccand http://svn.apache.org/viewvc?view=revision&revision=1486483

LUCENE-5012: add CharFilter, fix some bugs with SynFilter, add new InsertDeletedPunctuationStage

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I committed some changes:

Got charFilter working
Fixed a few bugs in SynFilterStage not clearing it's state on end / reset
Created a new fun stage: InsertDeletedPunctuationStage. This stage detects when a tokenizer has skipped over punctuation chars and inserts a deleted token representing the punctuation, e.g. to prevent a synonym or phrase query from matching over the punctuation. I had previously thought we would need to modify Tokenizers to do this but now I think maybe this Stage could do it for any Tokenizer ...

asfimport commented 11 years ago

Artem Lukanin (migrated from JIRA)

I guess WordDelimiterFilter is a good candidate for transforming into a graph-based filter (see LUCENE-5051).

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1597118 from @mikemccand in branch 'dev/branches/lucene5012' https://svn.apache.org/r1597118

LUCENE-5012: merge trunk, but some tests are failing

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1597427 from @mikemccand in branch 'dev/branches/lucene5012' https://svn.apache.org/r1597427

LUCENE-5012: get tests passing again

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1694511 from @mikemccand in branch 'dev/branches/lucene5012' https://svn.apache.org/r1694511

LUCENE-5012: don't separate interface from impl for attributes

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Seems very promising. Is #3524 a dependency on this issue? There's no dependency JIRA issue link the first comment suggests it is.

asfimport commented 7 years ago

Matt Weber (@mattweber) (migrated from JIRA)

@mikemccand So I was looking into supporting incoming graphs in SynonymGraphFilter and found this when you mentioned it in #8689. What do you think the state of this patch is? Would it be best to look into advancing this instead of just SynonymGraphFilter itself?

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we really should advance this: the synonym filter here already handles incoming graphs correctly, and the token stream API improvements here make it much easier to consume and create graph token streams. I think this will only get more important with time, e.g. the new WordDelimiterGraphFilter now creates correct graphs, but if you want to run the current SynonymGraphFilter after it, it won't work.

That said, there is still a lot of work to make this committable. I think we need to find a way to make the stage-based analysis components interchangeable with the current API to give us the freedom to gradually cutover the many tokenizers and tokenfilters Lucene has today.

asfimport commented 7 years ago

Matt Weber (@mattweber) (migrated from JIRA)

Patch for current master.

testSynAfterDecompoundStageAnalyzer and testSynStageAnalyzer randomly fail so will need to dig into this when I find some more time.

@mikemccand Can you take a look and make sure I didn't miss anything from the original? The attached patch wasn't up to date since you were working from a branch. Here is what I ran to get the latest:

svn diff --ignore-properties --old https://svn.apache.org/repos/asf/lucene/dev/trunk@1597052 --new https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5012@1694511 > LUCENE-5012.patch

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Wow, thanks for modernizing the patch @mattweber; I'll push this to branch on my github account for easier iterating...

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

@mattweber, I realized I had more private changes that I never pushed to that old branch, so I recovered them, fixed to apply to current master, and pushed here: https://github.com/mikemccand/lucene-solr/commits/graph_token_filters

I also removed the controversial InsertDeletedPunctuationStage.

Some tests are still failing ... I'll try to fix them.

I think the ideas here are very promising. The write-once attributes (LUCENE-2450, folded into this branch) is cleaner than what Lucene has today, and the ease of making new positions without having to re-number previous ones makes graph token streams much easier.

I tried to add the equivalent of CharFilter here, by using a new TextAttribute that stages before tokenization can use to read from a Reader or a String, and remap; I like that this makes offset correction more local than what the correctOffset exposes today. And it means char filtering is simply another stage, not a separate class.

I also added int[] parts to OffsetAttribute; the idea here is to empower token filters (not just tokenizers) to properly correct offsets, so that e.g. WDGF could work "correctly", but I'm not sure it's worth the hassle: I haven't fully implemented it, and doing so is surprisingly tricky.

asfimport commented 7 years ago

Matt Weber (@mattweber) (migrated from JIRA)

Thanks @mikemccand there was a lot of additional changes! I am going to start getting familiar with this and hopefully will be able to help move it forward as I get time.

asfimport commented 6 years ago

Ryan Pedela (migrated from JIRA)

I just have a question. I am using the word delimiter graph set to both split and concat followed by a shingle filter set to bigrams. I am getting what appears to be incorrect results. Is that expected and is the aim of this issue to fix that?

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This issue should make it easier to fix the bug you're seeing, but we can also fix the bug (in ShingleFilter I'm guessing?) before doing this more ambitious change.

It sounds like ShingleFilter is not looking at PositionLengthAttribute?

apache / lucene

Make graph-based TokenFilters easier [LUCENE-5012] #6076