KoichiYasuoka / spaCy-Thai

Dependency parser on Thai language
MIT License
25 stars 2 forks source link

Can you train Thai Treebanks Dataset? #1

Open wannaphong opened 3 years ago

wannaphong commented 3 years ago

I found Thai Treebanks Dataset. thtb_orchidpp.txt file is a treebank dataset from orchid corpus but it is not CoNLLU.

KoichiYasuoka commented 3 years ago

Umm... The dataset seems something like phrase structure. For example, the first line

[S [NP [FIXN การ]] [VP [VACT ประชุม] [PP [RPRE ทาง] [NP [NCMN วิชาการ] [PUNC <space>] [NP [NCMN ครั้ง] [DONM ที่ 1]]]]]]

denotes the phrase tree as shown below.

phrase tree

I trained spaCy-Thai with dependency trees, which are far different from the phrase tree...

KoichiYasuoka commented 3 years ago
# text = การประชุมทางวิชาการ ครั้งที่ 1
1   การ _   PART    FIXN    _   0   root    _   SpaceAfter=No
2   ประชุม  _   VERB    VACT    _   1   acl _   SpaceAfter=No
3   ทาง _   ADP RPRE    _   4   case    _   SpaceAfter=No
4   วิชาการ _   NOUN    NCMN    _   2   obl _   _
5   ครั้ง   _   NOUN    NCMN    _   1   list    _   SpaceAfter=No
6   ที่ _   DET PREL    _   7   det _   _
7   1   _   NUM DCNM    _   5   nummod  _   SpaceAfter=No

On the other hand the dependency tree is visualized as:

dependency tree

KoichiYasuoka commented 3 years ago

Well, how do we convert the phrase structure and the dependency tree into one another, @wannaphong?

wannaphong commented 3 years ago

Well, how do we convert the phrase structure and the dependency tree into one another, @wannaphong?

Sorry, I do not know because it is beyond the scope of my expertise. I think @korakot should help with this.

korakot commented 3 years ago

It's possible in theory. A constituency tree can be converted to a dependency tree with no ambiguity. For example

VP = V + NP can be converted to V -[dobj]-> NP

But there's no package library to do it for Thai (or even many other languages). You may need to convert them one by one.

You can search google to find some papers and 1 github for this. https://www.google.com/search?q=convert+constituency+tree+to+dependency+tree

Korakot

On Mon, Dec 14, 2020 at 10:49 PM Wannaphong Phatthiyaphaibun < notifications@github.com> wrote:

Well, how do we convert the phrase structure and the dependency tree into one another, @wannaphong https://github.com/wannaphong?

Sorry, I do not know because it is beyond the scope of my expertise. I think @korakot https://github.com/korakot should help with this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KoichiYasuoka/spaCy-Thai/issues/1#issuecomment-744529368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYCNPVZLBRK2KFGU5MI5YLSUYXX7ANCNFSM4UZWYJOQ .

KoichiYasuoka commented 3 years ago

VP = V + NP can be converted to V -[dobj]-> NP

Oh, it looks very nice. But I'm vague that S = NP + VP can be converted into NP <-[nsubj]- VP or NP <-[vocative]- VP or NP -[acl]-> VP...

korakot commented 3 years ago

For S = NP + VP It needs to look inside of NP and VP, so that we can know which [rel] it is. It's not ambiguous, though. You need to do a few if-then cases on PoS and word groups. It's a bit labor-intensive to list all cases.

Korakot

On Mon, Dec 21, 2020 at 3:39 PM Koichi Yasuoka notifications@github.com wrote:

VP = V + NP can be converted to V -[dobj]-> NP

Oh, it looks very nice. But I'm vague that S = NP + VP can be converted into NP <-[nsubj]- VP or NP <-[vocative]- VP or NP -[acl]-> VP...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KoichiYasuoka/spaCy-Thai/issues/1#issuecomment-748847333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYCNPWOVMOQ2IIGOPWE6ATSV4CTHANCNFSM4UZWYJOQ .