[TODO]: Train Part-of-speech corpus with transformers

wannaphong commented 1 year ago

Today, PyThaiNLP use perceptron tagger. It still give the best score from Blackboard treebank Test set (https://pythainlp.github.io/Model-Cards/Part%20of%20speech/#blackboard-perceptron) but most people want to use with transformers.

I think It is good if Part-of-speech tagging use transformers model.

List Model:

huggingface.co/airesearch/wangchanberta-base-att-spm-uncased

Docs: https://huggingface.co/learn/nlp-course/chapter7/2?fw=tf

List corpus:

Blackboard treebank https://bitbucket.org/kaamanita/blackboard-treebank/src/master/
Parallel Universal Dependencies (PUD) treebanks https://github.com/UniversalDependencies/UD_Thai-PUD

pavaris-pm commented 1 year ago

@wannaphong do you currently working on this?, if not, i can help you train the POS tagging model using transformer with listed component here, and then I will ask for your review after it is finished to see what we can improved further. What do you think?

wannaphong commented 1 year ago

@wannaphong do you currently working on this?, if not, i can help you train the POS tagging model using transformer with listed component here, and then I will ask for your review after it is finished to see what we can improved further. What do you think?

@MpolaarbearM is doing train model for Blackboard Treebank. You can do Orchid Corpus or UD Thai PUD.

Blackboard Treebank model by bert: https://huggingface.co/lunarlist/pos_thai

List https://nlpforthai.com/tasks/part-of-speech/

pavaris-pm commented 1 year ago

@wannaphong do you currently working on this?, if not, i can help you train the POS tagging model using transformer with listed component here, and then I will ask for your review after it is finished to see what we can improved further. What do you think?

@MpolaarbearM is doing train model for Blackboard Treebank. You can do Orchid Corpus or UD Thai PUD.

Blackboard Treebank model by bert: https://huggingface.co/lunarlist/pos_thai

List https://nlpforthai.com/tasks/part-of-speech/

thanks!, i will go for UD Thai PUD corpus and inform you when the model is finished.

pavaris-pm commented 1 year ago

@wannaphong i've already done training transformers on Thai Part-of-Speech corpus. As for discussion, the model were trained on UD Thai PUD corpus on Universal POS (UPOS) tag. All models are ported to Huggingface Hub already where list of my trained models are as follows:

WangchanBERTa : one existing language model for thai language as you stated in the model list within this issue that you want the corpus to be trained on, the training results already reported in https://huggingface.co/Pavarissy/wangchanberta-ud-thai-pud-upos
DeBERTaV3 : As of March 2023, DeBERTaV3 bring an impressive state-of-the-art performance on the NLU task benchmark compared to another models. I put this into your considerations since the performance of its multilingual version (mDeBERTaV3) which is Thai-supported achieved a better score on UD Thai PUD corpus as well. You can check its training results in https://huggingface.co/Pavarissy/mdeberta-v3-ud-thai-pud-upos

ps. both models are trained on a specified corpus, any improvement of them can be discussed from now on. Since it is public model on huggingface hub, if you want to integrate into PyThaiNLP, i can help you with it.

what do you think ?

MpolaarbearM commented 1 year ago

@pavaris-pm Hi! I can integrate your model into the new pos tagging function since I currently working on mine and nearly finished it.

pavaris-pm commented 1 year ago

@pavaris-pm Hi! I can integrate your model into the new pos tagging function since I currently working on mine and nearly finished it.

@MpolaarbearM Great to hear that! However, I have trained 2 pos tagging models. Which model will be integrated in ? Do we need any consideration from @wannaphong ?

MpolaarbearM commented 1 year ago

We can wait for approval. But the methods of integration are the same, so I'll do both of them for now.

pavaris-pm commented 1 year ago

Thanks for your help, after approval, please inform me when it is integrated 👍🏻

PyThaiNLP / pythainlp

[TODO]: Train Part-of-speech corpus with transformers #838