PolMine / cwbtools

Tools to create and manage CWB-indexed corpora
4 stars 2 forks source link

Build my own corpus by a specific language with its part of speech #41

Open svjack opened 3 years ago

svjack commented 3 years ago

I think if i want to use a specific tokenizer (for processing language such as CJK) to build corpus with part of speech, i should implement my own tokenstream and set it to CorpusData object and call encode method to format it. And with the help of decode function in https://github.com/PolMine/polmineR i can perform CQP on my own corpus . (then it is only require install cwbtools and polmineR without need the help from http://cwb.sourceforge.net/devs.php)

I want to know if i am right ?

And if the lexer use to parse CQP can also match the “pos” i defined by my own specific tokenizer ?