fnl / segtok

Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.
http://fnl.es/segtok-a-segmentation-and-tokenization-library.html
MIT License
170 stars 22 forks source link

Currency and percentages #5

Closed dineshbvadhia closed 9 years ago

dineshbvadhia commented 9 years ago

Ran the simple

segmenter data | tokenizer

and noticed that currency values split from the currency symbol eg. $4.12 -> $ 4.12. Similarly for percentages eg. %7.1 -> % 7.1.

fnl commented 9 years ago

Yes, and that's what you would expect from a correct tokenizer. Check the Penn PoS tags, this tokenization strategy is the de facto standard.

Florian On Jun 14, 2015 10:14 AM, "dineshbvadhia" notifications@github.com wrote:

Ran the simple

segmenter data | tokenizer

and noticed that currency values split from the currency symbol eg. $4.12 -> $ 4.12. Similarly for percentages eg. %7.1 -> % 7.1.

— Reply to this email directly or view it on GitHub https://github.com/fnl/segtok/issues/5.

fnl commented 9 years ago

Penn PoS tags define tokens by numbers with the tag CD, symbol tokens are tagged SYM for things like percentages, $ is used for currency tokens/symbols, and so forth.

In other words, the current behavior of segtok is the correct one and I am therefore closing this issue.