apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.49k stars 989 forks source link

When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole [LUCENE-3940] #5013

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

I modified BaseTokenStreamTestCase to assert that the start/end offsets match for graph (posLen > 1) tokens, and this caught a bug in Kuromoji when the decompounding of a compound token has a punctuation token that's dropped.

In this case we should leave hole(s) so that the graph is intact, ie, the graph should look the same as if the punctuation tokens were not initially removed, but then a StopFilter had removed them.

This also affects tokens that have no compound over them, ie we fail to leave a hole today when we remove the punctuation tokens.

I'm not sure this is serious enough to warrant fixing in 3.6 at the last minute...


Migrated from LUCENE-3940 by Michael McCandless (@mikemccand), resolved Apr 08 2012 Attachments: LUCENE-3940.patch (versions: 4)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch; I think it's ready, except, I need to make a simpler test case than that new curious string....

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, fixing a bug in the last one, and adding a few more test cases. I also made the "print curious string on exception" from BTSTC more ascii-friendly.

I think it's ready.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I dont think we should do this. StandardTokenizer doesnt leave holes when it drops punctuation, I think holes should only be real 'words' for the most part

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

StandardTokenizer doesnt leave holes when it drops punctuation,

But is that really good?

This means a PhraseQuery will match across end-of-sentence (.), semicolon, colon, comma, etc. (English examples..).

I think tokenizers should throw away as little information as possible... we can always filter out such tokens in a later stage?

For example, if a tokenizer created punct tokens (instead of silently discarding them), other token filters could make use of them in the mean time, eg a synonym rule for "u.s.a. -> usa" or maybe a dedicated English "acronyms" filter. We could then later filter them out, even not leaving holes, and have the same behavior that we have now?

Are there non-English examples where you would want the PhraseQuery to match over punctuation...? EG, for Japanese, I assume we don't want PhraseQuery applying across periods/commas, like it will now? (Not sure about middle dot...? Others...?).

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New test-only patch, breaking out the non-controversial (I think!) part of the patch.

With this new patch, Kuromoji still silently discards punctuation (just like StandardAnalyzer), but at least we get better test coverage in BTSTC to verify graph tokens are not messing up their offsets.

I had to turn it off when testing Kuromoji w/ punctuation removal... but it's still tested w/o punctuation removal, so I think it'd likely catch any bugs in how Kuromoji sets offsets of the compound tokens... at least it's better than not checking at all (ie, today).

The only non-tests-only change is I uncommented an assert in Kuromoji; I think it's a valid assert.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think kuromoji has the problem: here it creates 'fake' intermediate 'tokens' and then deletes them and this somehow screws up the decompounding graph???

it should never create these tokens in the first place! I think its well accepted that words carry the information content of a doc, punctuation has no information content really here, it doesn't tell me what the doc is about, and I don't think this is controversial, I just think your view on this is extreme...

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Here's an example where we create a compound token with punctuation.

I got this from the Japanese Wikipedia export, with our MockCharFilter sometimes doubling characters: we are at a position that the characters 〇〇'''、''' after it... that 〇 is this Unicode character http://www.fileformat.info/info/unicode/char/3007/index.htm

When Kuromoji extends from this position, both 〇 and 〇〇 are KNOWN, but then we also extend by unknown 〇〇'''、''' (ie, 〇〇 plus only punctuation):

Note that 〇 is not considered punctuation by Kuromoji's isPunctuation method...

  + UNKNOWN word 〇〇'''、''' toPos=41 cost=21223 penalty=3400 toPos.idx=0
  + KNOWN word 〇〇 toPos=34 cost=9895 penalty=0 toPos.idx=0
  + KNOWN word 〇 toPos=33 cost=2766 penalty=0 toPos.idx=0
  + KNOWN word 〇 toPos=33 cost=5256 penalty=0 toPos.idx=1

And then on backtrace we make a compound token (UNKNOWN) for all of "〇〇'''、'''", while the decompounded path keeps two separate "〇" tokens but drops the '''、''' since it's all punctuation, thus creating inconsistent offsets.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I think its well accepted that words carry the information content of a doc, punctuation has no information content really here, it doesn't tell me what the doc is about, and I don't think this is controversial, I just think your view on this is extreme...

I disagree with you, Robert. (If punctuation has no information content, why does it exist?) IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to trigger position increment gaps.

StandardTokenizer doesnt leave holes when it drops punctuation, I think holes should only be real 'words' for the most part

"Standard"Tokenizer is drawn from Unicode UAX#29, which only describes word boundaries. Lucene has grafted onto these boundary rules an assumption that only alphanumeric "words" should be tokens - this assumption does not exist in the standard itself.

My opinion is that we should have both types of things: a tokenizer that discards non-alphanumeric characters between word boundaries; and different type of analysis component that discards nothing. I think of the discard-nothing process as segmentation rather than tokenization, and I've argued for it previously.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK here's one possible fix...

Right now, when we are glomming up an UNKNOWN token, we glom only as long as the character class of each character is the same as the first character.

What if we also require that isPunct-ness is the same? That way we would never create an UNKNOWN token mixing punct and non-punct...

I implemented that and the tests seem to pass w/ offset checking fully turned on again...

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, implementing the idea above. I think it's ready...

It also means we can unconditionally check for correct offsets in graph tokens, which is nice.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I disagree with you, Robert. (If punctuation has no information content, why does it exist?) IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to trigger position increment gaps.

Punctuation simply doesn't tell you anything about the document: this is fact. if we start indexing punctuation we just create useless terms that go to every document

Because of this, nobody wastes their time trying to figure out how index "punctuation tokens". Mike's problem is basically the fact he is creating a compound token of '??'

Furthermore, the idea that 'if we don't leave a hole for anything removed, we are losing formation' is totally arbitrary, confusing, and inconsistent anyway. How come we leave holes for definitiveness in english but not for plurals in english, but in arabic or bulgarian we don't leave holes for definiteness, because it happens to be attached to the word and stemmed?

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I'm not familiar with the various considerations that were made with StandardTokenizer, but please allow me to share some comments anyway.

Perhaps it's useful to distinguish between analysis for information retrieval and analysis for information extraction here?

I like Michael's and Steven's idea of doing tokenization that doesn't discard any information. This is certainly useful in the case of information extraction. For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot).

Robert is of course correct that we generally don't want to index punctuation characters that occur in every document, so from an information retrieval point of view, we'd like punctuation characters removed.

If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well.

If there's no established convention, it seems useful that a Tokenizer would provide as much details as possible with text being input and leave downstream Filters/Analyzers to remove whatever is suitable based on a particular processing purpose. We can provide common ready-to-use Analyzers with reasonable defaults that users can look to, i.e. to process a specific language or do another common high-level task with text.

Hence, perhaps each Tokenizer can decide what makes the most sense to do based on that particular tokenizer's scope of processing?

To Roberts point, this would leave processing totally arbitrary and consistent, but this would be by design as it wouldn't be Tokenizer's role to enforce any overall consistency – i.e. with regards to punctuation – higher level Analyzers would provide that.

Thoughts?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Perhaps it's useful to distinguish between analysis for information retrieval and analysis for information extraction here?

Yes, since we are an information retrieval library, then there is no sense in adding traps and complexity to our analysis API.

I like Michael's and Steven's idea of doing tokenization that doesn't discard any information.

For IR, this is definitely not information... calling it data is a stretch.

If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well.

Thats what the tokenizers do today, they find tokens (In the IR sense). So yeah, there is an established convention already. Changing this would be a monster trap because suddenly tons of people would be indexing tons of useless punctuation. I would strongly oppose such a change.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This is certainly useful in the case of information extraction. For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot).

The option still exists in kuromoji (discardPunctuation=false) if you want to use it for this. I added this option because it originally kept the punctuation (for downstream filters to remove).

lucene-gosen worked the same way, and after some time i saw numerous examples across the internet where people simply configured the tokenizer without any filters, which means huge amounts of punctuation being indexed by default. People don't pay attention to documentation or details, so all these people were getting bad performance.

Based on this experience, I didn't want keeping punctuation to be the default, nor even achievable via things like solr factories here. But i added the (expert) option to Kuromoji because its really a more general purpose things for japanese analysis, because its already being used for other things, and because allowing a boolean was not expensive or complex to support.

But I don't consider this a bonafied option from the lucene apis, i would be strongly against adding this to the solr factories, or as an option to KuromojiAnalyzer. And, I don't think we should add such a thing to other tokenizers either.

Our general mission is search, if we want to decide we are expanding our analysis API to be generally useful outside of information retrieval, thats a much bigger decision with more complex tradeoffs that everyone should vote on (e.g. moving analyzers outside of lucene.apache.org and everything).

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Bulk close for 3.6.1