BreakIteratorSegmenter turns hyphens to separate tokens

codeaudit / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl

0 stars 0 forks source link

BreakIteratorSegmenter turns hyphens to separate tokens #98

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

Situation: a text with a lot of words which have been hyphenated just for 
line-wrapping reasons, e.g.

We, therefore, the represen-
tatives of the United States

As it turned out, the BreakIteratorSegmenter turns these hyphens to separate 
tokens, which was not what I expected.

Is there a use case where separate tokens for hyphens are desired?

See the attached test case for reproducing the issue.

Original issue reported on code.google.com by eckle.kohler on 18 Oct 2012 at 6:22

Attachments:

PedocsCleanerTest.java

GoogleCodeExporter commented 9 years ago

The BreakIteratorSegmenter is more technically motivated that linguistically. 
This is just how the underlying Java BreakIterator behaves.

We could consider to to add a parameter to directly merge word and hyphen 
tokens if they are directly adjacent and the token following the hypen is not 
directly adjacent.

I did a little test with the other segmenters on the following example:

Input: "ihre Negativbei- spiele immer"
Expected: "ihre", "Negativbei-", "spiele", "immer" 

BreakIteratorSegmenter: "-" is separate token
OpenNlpSegmenter: as expected
StanfordSegmenter: "-" is separate token
LanguageToolSegmenter: as expected

Original comment by richard.eckart on 21 Oct 2012 at 11:28

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 16 Feb 2013 at 10:56

Changed title: [tokit] BreakIteratorSegmenter turns hyphens to separate tokens

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 16 Feb 2013 at 11:04

Changed title: BreakIteratorSegmenter turns hyphens to separate tokens
Added labels: Module-tokit

GoogleCodeExporter commented 9 years ago

So I'm not sure what to do here. It's the case that the BreakIterator does 
split like this. We have other segmenters, that are smarter. I'd just close 
this issue as WontFix...

Original comment by richard.eckart on 18 Mar 2013 at 4:44

GoogleCodeExporter commented 9 years ago

StanfordSegmenter also splits the hyphen into a separate token. 
OpenNlpSegmenter and LanguageToolSegmenter do not.

Original comment by richard.eckart on 19 Mar 2013 at 6:52

GoogleCodeExporter commented 9 years ago

Closing this. It's how the BreatIterator works. Use a different segmenter if 
this behavior is not good for a certain use case.

Original comment by richard.eckart on 24 Jun 2013 at 10:45

Changed state: WontFix