Tokenizer feature - Githubissues

Niger-Volta-LTI / iranlowo

Ìrànlọ́wọ́ is a utility library for analysis & (pre)processing of Yorùbá text → https://pypi.org/project/iranlowo

MIT License

17 stars 8 forks source link

Tokenizer feature #17

Open Olamyy opened 4 years ago

Olamyy commented 4 years ago

Thinking about including a tokenizer class in the project. I'm thinking the API could look like:

from iranlowo.tokenizer import Tokenizer

text = "some text"
word_tokens = Tokenizer(text).word_tokenize()
sentence_tokens = Tokenizer(text).sentence_tokenize()

ruohoruotsi commented 4 years ago

I think the API looks good, it would simplify the requirements.txt which currently for ADR depend on NLTK etc for tokenization.