emorynlp / nlp4j

NLP framework for JVM languages.
http://emorynlp.github.io/nlp4j/
Other
149 stars 33 forks source link

Tokenizer is failing for some use cases #32

Open rakeshsubrahmanyam opened 6 years ago

rakeshsubrahmanyam commented 6 years ago

Query: 03-Aug-06 Expected output after Tokenization: 03-Aug-06 Actual output after Tokenization: 03 - Aug-06

Query:03-Aug-2006 Expected output after Tokenization: 03-Aug-2006 Actual output after Tokenization: 03 - Aug-2006

Query:03-Aug-2006 18:55:30.35 Expected output after Tokenization: 03-Aug-2006 18:55:30.35 Actual output after Tokenization: 03-Aug-2006 18:55:30.35

Query:03Aug06 18:55:30.35 Expected output after Tokenization: 03 Aug 06 18:55:30.35 Actual output after Tokenization: 03Aug06

Query:03Aug2006 18:55:30.35 Expected output after Tokenization: 03 Aug 2006 18:55:30.35 Actual output after Tokenization: 03Aug2006

Query: Jan 21, '97 Expected output after Tokenization: Jan 21 , ' 97 Actual output after Tokenization: Jan 21, '97

Query: Jan 21, '97 Expected output after Tokenization: Jan 21 , ' 97 Actual output after Tokenization: Jan 21, '97

Query: 02/03/2000-03/03/2000 Expected output after Tokenization: 02/03/2000 - 03/03/2000 Actual output after Tokenization: 02/03/2000-03/03/2000

Query: 1990's Expected output after Tokenization: 1990 ' s Actual output after Tokenization: 1990's

Query: 1990's Expected output after Tokenization: 1990 ' s Actual output after Tokenization: 1990's