add sentence boundary charfilter [LUCENE-2498]

asfimport commented 14 years ago

From the discussion of #3243:

It would be nice to have a CharFilter? to mark sentence boundaries. Such functionality would be useful for:

prevent phrase queries with 0 slop from matching across sentences
inhibiting multiword synonyms, or shingles, etc.

For sentence boundary detection we could use Jflex's support for the Unicode Sentence_Break property etc, and the UAX#29 definition as a default grammar.

One idea is to just mark the boundaries with a user-provided String.

As a simple use-case, a user could then add this string to a stopfilter, and it would introduce a position increment. This would inhibit phrase queries, etc.

a user could use the sentence-markers to do more advanced processing downstream.

Migrated from LUCENE-2498 by Robert Muir (@rmuir), updated May 16 2011

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I wonder if it would be possible/make sense to make this a tokenizer instead of a charfilter: one token per sentence. Then token production would be in a filter stage.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

FWIW, I've implemented sentence boundary support by returning a special token (Type.EOF) from the Tokenizer and created a EOSFilter which increments the posIncr attribute (set it to 100). I haven't been following UAX#29 issue, but I'm sure it's a great thing :).

I've used the following http://www.fileformat.info/info/unicode/category/Po/list.htm to detect EOS + \u0085 (Next Line (NEL). I'm sure though that there are other markers as well.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I wonder if it would be possible/make sense to make this a tokenizer instead of a charfilter: one token per sentence. Then token production would be in a filter stage.

well maybe we should reword the issue, it doesnt have to be a charfilter or even use a special string to mark the sentence boundaries. But I thought as a charfilter it would allow you to use your own tokenizer, such as StandardTokenizer along with sentence boundaries.

FWIW, I've implemented sentence boundary support by returning a special token (Type.EOF) from the Tokenizer and created a EOSFilter which increments the posIncr attribute (set it to 100).

This sounds similar to what we are proposing here... did you integrate this into your tokenizer yourself?

I've used the following http://www.fileformat.info/info/unicode/category/Po/list.htm to detect EOS + \u0085 (Next Line (NEL). I'm sure though that there are other markers as well.

Well, there is nothing wrong with this approach. The advantage of using the unicode segmentation standard for sentences is that it can give some better handling for corner cases, since it has a grammar.

some examples quoted directly from the spec: http://unicode.org/reports/tr29/#Sentence_Boundaries

Rules SB6-8 are designed to forbid breaks within strings such as

c.d
3.4
U.S.
... the resp. leaders are ...
... etc.)' '(the ...

They permit breaks in strings such as

She said "See spot run."	John shook his head. ...
... etc.	它们指...
...理数字.	它们指...

They cannot detect cases such as "...Mr. Jones..."; more sophisticated tailoring would be required to detect such cases.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I also have an abbreviations filter, which once it faces an EOS token it checks its table for the previous word - if it's a match then it does not consider this a true EOS token. So cases like "mr." are covered.

There are false negatives too though, if the abbreviation does end a sentence, but you've got to make some trade-offs ...

like I said, I'm sure the standard defines things better than I did...

asfimport commented 14 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I wonder if it would be possible/make sense to make this a tokenizer instead of a charfilter: one token per sentence. Then token production would be in a filter stage.

well maybe we should reword the issue, it doesnt have to be a charfilter or even use a special string to mark the sentence boundaries. But I thought as a charfilter it would allow you to use your own tokenizer, such as StandardTokenizer along with sentence boundaries.

Using "special" tokens to carry token stream metadata feels like a fragile hack to me.

Maybe we need a new kind of analysis component to chunk input and then call a tokenizer for each chunk. It could be called a segmenter. So e.g. SentenceSegmenter would detect sentence boundaries, then send each sentence to the user's choice of tokenizer, fixing up offsets and maybe also adding in an attribute for beginning/end of sentence. Analysis chains could use a segmenter+tokenizer in the same way that a tokenizer is now used. I think segmenters could be nested, too, e.g. a ParagraphSegmenter could take in a SentenceSegmenter as its tokenizer.

apache / lucene

add sentence boundary charfilter [LUCENE-2498] #3572