tokenizing '20th' to '2','0','th'

cbaziotis / ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

MIT License

661 stars 90 forks source link

tokenizing '20th' to '2','0','th' #30

Open KavishBhatia opened 3 years ago

KavishBhatia commented 3 years ago

How to make this as one token and not separate it. Where is this tokenizing happening?

AzharSultan commented 2 years ago

it happen in the default pipeline of tokenizer here. You can pass a custom pipeline to the tokenizer and removing "EMOJI" from that pipeline removes this problem.