emorynlp / nlp4j-tokenization

Tokenize raw texts into tokens and sentences.
Other
6 stars 4 forks source link

Fixed offsets in addSymbol tokenization method #4

Closed spraynasal closed 8 years ago

spraynasal commented 8 years ago

Invalid bIndex generated invalid start and end offsets in subsequent addSymbols tokenization.

jdchoi77 commented 8 years ago

Thank you!

jdchoi77 commented 8 years ago

Actually, this interpretation is not correct; I'll revert the code back. Please see our unit test EnglishTokenizerOffsetTest.

spraynasal commented 8 years ago

You are absolutely right, I missed the unit test ! I've come across some edge cases I was trying to solve, while this fixes some of these cases it also breaks some that worked before, I'll review everything then submit another pull request !