Closed dineshbvadhia closed 9 years ago
Yes, and that's what you would expect from a correct tokenizer. Check the Penn PoS tags, this tokenization strategy is the de facto standard.
Florian On Jun 14, 2015 10:14 AM, "dineshbvadhia" notifications@github.com wrote:
Ran the simple
segmenter data | tokenizer
and noticed that currency values split from the currency symbol eg. $4.12 -> $ 4.12. Similarly for percentages eg. %7.1 -> % 7.1.
— Reply to this email directly or view it on GitHub https://github.com/fnl/segtok/issues/5.
Penn PoS tags define tokens by numbers with the tag CD, symbol tokens are tagged SYM for things like percentages, $ is used for currency tokens/symbols, and so forth.
In other words, the current behavior of segtok
is the correct one and I am therefore closing this issue.
Ran the simple
and noticed that currency values split from the currency symbol eg. $4.12 -> $ 4.12. Similarly for percentages eg. %7.1 -> % 7.1.