I think if i want to use a specific tokenizer (for processing language such as CJK) to build corpus with part of speech,
i should implement my own tokenstream and set it to CorpusData object
and call encode method to format it.
And with the help of decode function in https://github.com/PolMine/polmineR
i can perform CQP on my own corpus .
(then it is only require install cwbtools and polmineR without need the help from
http://cwb.sourceforge.net/devs.php)
I want to know if i am right ?
And if the lexer use to parse CQP can also match the “pos” i defined by my own specific tokenizer ?
I think if i want to use a specific tokenizer (for processing language such as CJK) to build corpus with part of speech, i should implement my own tokenstream and set it to CorpusData object and call encode method to format it. And with the help of decode function in https://github.com/PolMine/polmineR i can perform CQP on my own corpus . (then it is only require install cwbtools and polmineR without need the help from http://cwb.sourceforge.net/devs.php)
I want to know if i am right ?
And if the lexer use to parse CQP can also match the “pos” i defined by my own specific tokenizer ?