Closed wannaphong closed 3 years ago
could you please explain a bit more about what needed to be done here?
I require dependency parsing for thai language to PyThaiNLP , but I hasn't a thai treebank corpus for dependency parser.
Test with Parallel Universal Dependencies https://github.com/wannaphongcom/thai-dep
I've just tried to make a dependency parser for Thai with pythainlp
and UDPipe
, and released it as spaCy-Thai. I'm vague that its outputs are precise or not, but the word-lengths seem different between pythainlp
and UD_Thai-PUD. Umm...
As I understand, spaCy-Thai combines 3 parts into a pipeline
Word-length difference comes from the first part. We may need to use custom word list so the tokenization has the same word lengths. Maybe you can use pythainlp tokenization directly instead of through spaCy.
We have part of speech model with Parallel Universal Dependencies (PUD) treebanks. pos_tag (=pud) https://www.thainlp.org/pythainlp/docs/2.2/api/tag.html#pythainlp.tag.pos_tag
I think that would be appropriate for your job.
In th_pud-ud-orchid.conllu of spaCy-Thai I use both PUD-origin UPOS-tags and pythainlp-origin ORCHID-tags. As shown in train.sh I use hyper-parameters embedding_upostag=10
for UPOS-tags and embedding_xpostag=30
for ORCHID-tags. I've tried some other parameters, but either PUD-only or ORCHID-only was worse.
Have you tried LST20 corpus (tags mapped to Universal POS tags) yet? pos_tag (lst20_ud) (PyThaiNLP 2.2.4+ Only) https://www.thainlp.org/pythainlp/docs/2.2/api/tag.html#pythainlp.tag.pos_tag
I haven't sent any new tags to spaCy.
In th_pud-ud-orchid.conllu of spaCy-Thai I use both PUD-origin UPOS-tags and pythainlp-origin ORCHID-tags. As shown in train.sh I use hyper-parameters
embedding_upostag=10
for UPOS-tags andembedding_xpostag=30
for ORCHID-tags. I've tried some other parameters, but either PUD-only or ORCHID-only was worse.
Have you tried LST20 corpus (tags mapped to Universal POS tags) yet?
No, I have not. I've first started with tag_map.py of spacy.lang.th
so I've been using ORCHID and UPOS. Do you have a plan to include LST20 tag_map for official release of spaCy in the future?
Have you tried LST20 corpus (tags mapped to Universal POS tags) yet?
No, I have not. I've first started with tag_map.py of
spacy.lang.th
so I've been using ORCHID and UPOS. Do you have a plan to include LST20 tag_map for official release of spaCy in the future?
I working about this pull request. https://github.com/explosion/spaCy/pull/6163
@KoichiYasuoka They're phasing out tag maps in the core spacy library in v3. https://github.com/explosion/spaCy/pull/6163#issuecomment-704803365
I'm not so familiar with the "v3" of spaCy, but here I understand that spaCy's Token.tag_
and Token.pos_
are now free from one another. I had given up to use train
of spaCy for th_pud-ud-orchid.conllu, then... well... how will I change my spaCy-Thai when the "v3" launch...
If there's a problem with Thai Tokenizer. You can use a custom tokenizer with your own word list. Here's how to change it in spaCy.
!pip install pythainlp
import pythainlp
from pythainlp.corpus import ttc
# create custom tokenizer
min_words = [w for w,_ in ttc.word_freqs()] # can add to it
tok = pythainlp.Tokenizer(min_words)
from spacy.lang.th import Thai
nlp = Thai()
nlp.tokenizer.word_tokenize = tok.word_tokenize # change to custom
list(nlp('ฝนตกที่ทะเล'))
# ['ฝน', 'ตก', 'ที่', 'ทะเล'] because no 'ฝนตก' in min_words
I will close this issues and I will add spaCy-Thai
recommend for Thai dependency parser
to PyThaiNLP Documentation.
Thank you @KoichiYasuoka for dependency parser
We require a thai treebank corpus for dependency parser. If you are interested, please contact back in this issue.