Niger-Volta-LTI / iranlowo

Ìrànlọ́wọ́ is a utility library for analysis & (pre)processing of Yorùbá text → https://pypi.org/project/iranlowo
MIT License
17 stars 8 forks source link

Tokenizer #18

Closed Olamyy closed 4 years ago

Olamyy commented 4 years ago

Introduced a tokenizer class. Currently supports 3 forms of tokenization.

  1. Word Tokenization based on gensim.
  2. Syllable Tokenization based on the work https://www.researchgate.net/publication/321184495_Development_of_a_Syllabicator_for_Yoruba_Language
  3. Subword tokenization (not yet implemented)
  4. Sentence tokenization (the initial implementation available in iranlowo.)