OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

What's the tagset used by pybo? #31

Closed BLKSerene closed 5 years ago

BLKSerene commented 5 years ago

Hi, the tagset used by pybo is not documented and it seems to me that pybo uses the UD POS tags, but not identical to that.

Some additional POS tags are: OOV (unknown words?) -> X? OTHER (punctuation marks and symbols?) -> SYM/X? non-word (non-tibetan word or punctuation marks?) -> X?

And "punct" is lowercase, which should be mapped to PUNCT (as per the description of UD POS tags)

I'm not sure whether there are other POS tags used, could you please list all possible POS tags and give a simple description of them?

drupchen commented 5 years ago

The tagset is a simplification of the Tibetan in Digital Communication tagset. If POS for Tibetan have some interest to you, I think that is the best source you can get, besides the ones implemented on the basis of Tibetan traditional grammar by Monlam in his dictionary.

Right now, the support of POS tags is not a feature that I put forward because I am not satisfied by what pybo does with them at the moment, yet having those from TiDC is better than nothing. I know Edward Garett is building his own tagset from TiDC, but having them conform to UD, so that may be better than what pybo uses right now. By the way, changing the set is pretty straightforward: you just need to update/replace this file

Also keep in mind that I distinguish Token#tag and Token#pos in that the tag attribute comes from pybo's preprocessing. the tag is simply copied to the pos attribute in cases where no POS are available in the file above-mentioned.

Hope this helps a bit.

BLKSerene commented 5 years ago

Thanks a lot.