PyThaiNLP / pythainlp

Thai natural language processing in Python
https://pythainlp.org/
Apache License 2.0
990 stars 274 forks source link

Thai dependency parser #218

Closed wannaphong closed 3 years ago

wannaphong commented 5 years ago

We require a thai treebank corpus for dependency parser. If you are interested, please contact back in this issue.

p16i commented 5 years ago

could you please explain a bit more about what needed to be done here?

wannaphong commented 5 years ago

I require dependency parsing for thai language to PyThaiNLP , but I hasn't a thai treebank corpus for dependency parser.

wannaphong commented 5 years ago

Test with Parallel Universal Dependencies https://github.com/wannaphongcom/thai-dep

KoichiYasuoka commented 4 years ago

I've just tried to make a dependency parser for Thai with pythainlp and UDPipe, and released it as spaCy-Thai. I'm vague that its outputs are precise or not, but the word-lengths seem different between pythainlp and UD_Thai-PUD. Umm...

korakot commented 4 years ago

As I understand, spaCy-Thai combines 3 parts into a pipeline

Word-length difference comes from the first part. We may need to use custom word list so the tokenization has the same word lengths. Maybe you can use pythainlp tokenization directly instead of through spaCy.

wannaphong commented 4 years ago

We have part of speech model with Parallel Universal Dependencies (PUD) treebanks. pos_tag (=pud) https://www.thainlp.org/pythainlp/docs/2.2/api/tag.html#pythainlp.tag.pos_tag

I think that would be appropriate for your job.

KoichiYasuoka commented 4 years ago

In th_pud-ud-orchid.conllu of spaCy-Thai I use both PUD-origin UPOS-tags and pythainlp-origin ORCHID-tags. As shown in train.sh I use hyper-parameters embedding_upostag=10 for UPOS-tags and embedding_xpostag=30 for ORCHID-tags. I've tried some other parameters, but either PUD-only or ORCHID-only was worse.

wannaphong commented 4 years ago

Have you tried LST20 corpus (tags mapped to Universal POS tags) yet? pos_tag (lst20_ud) (PyThaiNLP 2.2.4+ Only) https://www.thainlp.org/pythainlp/docs/2.2/api/tag.html#pythainlp.tag.pos_tag

I haven't sent any new tags to spaCy.

In th_pud-ud-orchid.conllu of spaCy-Thai I use both PUD-origin UPOS-tags and pythainlp-origin ORCHID-tags. As shown in train.sh I use hyper-parameters embedding_upostag=10 for UPOS-tags and embedding_xpostag=30 for ORCHID-tags. I've tried some other parameters, but either PUD-only or ORCHID-only was worse.

KoichiYasuoka commented 4 years ago

Have you tried LST20 corpus (tags mapped to Universal POS tags) yet?

No, I have not. I've first started with tag_map.py of spacy.lang.th so I've been using ORCHID and UPOS. Do you have a plan to include LST20 tag_map for official release of spaCy in the future?

wannaphong commented 4 years ago

Have you tried LST20 corpus (tags mapped to Universal POS tags) yet?

No, I have not. I've first started with tag_map.py of spacy.lang.th so I've been using ORCHID and UPOS. Do you have a plan to include LST20 tag_map for official release of spaCy in the future?

I working about this pull request. https://github.com/explosion/spaCy/pull/6163

wannaphong commented 4 years ago

@KoichiYasuoka They're phasing out tag maps in the core spacy library in v3. https://github.com/explosion/spaCy/pull/6163#issuecomment-704803365

KoichiYasuoka commented 4 years ago

I'm not so familiar with the "v3" of spaCy, but here I understand that spaCy's Token.tag_ and Token.pos_ are now free from one another. I had given up to use train of spaCy for th_pud-ud-orchid.conllu, then... well... how will I change my spaCy-Thai when the "v3" launch...

korakot commented 4 years ago

If there's a problem with Thai Tokenizer. You can use a custom tokenizer with your own word list. Here's how to change it in spaCy.

!pip install pythainlp
import pythainlp
from pythainlp.corpus import ttc

# create custom tokenizer
min_words = [w for w,_ in ttc.word_freqs()]  # can add to it
tok = pythainlp.Tokenizer(min_words)

from spacy.lang.th import Thai
nlp = Thai()
nlp.tokenizer.word_tokenize = tok.word_tokenize # change to custom
list(nlp('ฝนตกที่ทะเล'))  
# ['ฝน', 'ตก', 'ที่', 'ทะเล']  because no 'ฝนตก' in min_words
wannaphong commented 3 years ago

I will close this issues and I will add spaCy-Thai recommend for Thai dependency parser to PyThaiNLP Documentation.

Thank you @KoichiYasuoka for dependency parser

wannaphong commented 3 years ago

Done.

Notebook: https://github.com/PyThaiNLP/tutorials/blob/master/source/notebooks/Thai_Dependency_Parser.ipynb Website: https://pythainlp.github.io/tutorials/notebooks/Thai_Dependency_Parser.html