GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
93 stars 15 forks source link

Feature/emoticon handler [resolves #264] #289

Closed ertugrul-dmr closed 2 years ago

ertugrul-dmr commented 2 years ago

Basic Usage:

from sadedegel.bblock.tokenizers import ICUTokenizer
text = "komik:))
tokenizer = ICUTokenizer(emoticon=False)
tokenizer(text)
>>output ['komik', ':', ')',')']

### if emoticon set to True:

tokenizer = ICUTokenizer(emoticon=True)
tokenizer(text)
>>output ['komik', ':))']
dafajon commented 2 years ago

Thanks for contribution. In addition, further in the PR can you report new results of social media/comment based prebuilt models optimized with this feature.

ertugrul-dmr commented 2 years ago

Thanks for contribution. In addition, further in the PR can you report new results of social media/comment based prebuilt models optimized with this feature.

Done, really small gains f1 wise, like ~0.005, but might be useful for more not preprocessed text datas...