ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

Add tokenizer for the Pagel Tibetan corpus #12

Closed ajenhl closed 10 years ago

ajenhl commented 10 years ago

To support TACL operating on the (extracted) Pagel Tibetan corpus documents, add a suitable tokenizer (whitespace separated tokens) and a means for specifying that it should be used in generating n-grams and making reports.