GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
92 stars 15 forks source link

Hashtag, Emoji and Mention Handler not Working #273

Open ertugrul-dmr opened 3 years ago

ertugrul-dmr commented 3 years ago

While working on some preprocessing steps (#270) we noticed that arguments for tokenizers (including Text2Doc) has no effect on the outcome. Basically enable/disable options are not working.

For this issue we might need to create test cases and find the root cause of the problem.

To replicate this you can basically follow these steps:

from sadedegel.bblock.word_tokenizer import ICUTokenizer
tokenizer = ICUTokenizer(hashtag=True, mention=True, emoji=True)
text = 'bu #bir @metindir, 🍰'
print(tokenizer._tokenize(text))

Will result same with False or True options. Same goes for Text2Doc example below:

from sadedegel.extension.sklearn import Text2Doc
tokenizer = Text2Doc(hashtag=True, mention=True, emoji=True)
texts =[ 'bu', '#bir', '@metindir', '🍰']
print(tokenizer.transform(texts))
husnusensoy commented 3 years ago

Please refer to pr #274 to see that this is not a bug but the expected behavior