PyThaiNLP / pythainlp

Thai Natural Language Processing in Python.
https://pythainlp.org/
Apache License 2.0
975 stars 272 forks source link

[TODO]: Train Part-of-speech corpus with transformers #838

Closed wannaphong closed 11 months ago

wannaphong commented 1 year ago

Today, PyThaiNLP use perceptron tagger. It still give the best score from Blackboard treebank Test set (https://pythainlp.github.io/Model-Cards/Part%20of%20speech/#blackboard-perceptron) but most people want to use with transformers.

I think It is good if Part-of-speech tagging use transformers model.

List Model:

Docs: https://huggingface.co/learn/nlp-course/chapter7/2?fw=tf

List corpus:

pavaris-pm commented 11 months ago

@wannaphong do you currently working on this?, if not, i can help you train the POS tagging model using transformer with listed component here, and then I will ask for your review after it is finished to see what we can improved further. What do you think?

wannaphong commented 11 months ago

@wannaphong do you currently working on this?, if not, i can help you train the POS tagging model using transformer with listed component here, and then I will ask for your review after it is finished to see what we can improved further. What do you think?

@MpolaarbearM is doing train model for Blackboard Treebank. You can do Orchid Corpus or UD Thai PUD.

Blackboard Treebank model by bert: https://huggingface.co/lunarlist/pos_thai

List https://nlpforthai.com/tasks/part-of-speech/

pavaris-pm commented 11 months ago

@wannaphong do you currently working on this?, if not, i can help you train the POS tagging model using transformer with listed component here, and then I will ask for your review after it is finished to see what we can improved further. What do you think?

@MpolaarbearM is doing train model for Blackboard Treebank. You can do Orchid Corpus or UD Thai PUD.

Blackboard Treebank model by bert: https://huggingface.co/lunarlist/pos_thai

List https://nlpforthai.com/tasks/part-of-speech/

thanks!, i will go for UD Thai PUD corpus and inform you when the model is finished.

pavaris-pm commented 11 months ago

@wannaphong i've already done training transformers on Thai Part-of-Speech corpus. As for discussion, the model were trained on UD Thai PUD corpus on Universal POS (UPOS) tag. All models are ported to Huggingface Hub already where list of my trained models are as follows:

  1. WangchanBERTa : one existing language model for thai language as you stated in the model list within this issue that you want the corpus to be trained on, the training results already reported in https://huggingface.co/Pavarissy/wangchanberta-ud-thai-pud-upos

  2. DeBERTaV3 : As of March 2023, DeBERTaV3 bring an impressive state-of-the-art performance on the NLU task benchmark compared to another models. I put this into your considerations since the performance of its multilingual version (mDeBERTaV3) which is Thai-supported achieved a better score on UD Thai PUD corpus as well. You can check its training results in https://huggingface.co/Pavarissy/mdeberta-v3-ud-thai-pud-upos

ps. both models are trained on a specified corpus, any improvement of them can be discussed from now on. Since it is public model on huggingface hub, if you want to integrate into PyThaiNLP, i can help you with it.

what do you think ?

MpolaarbearM commented 11 months ago

@pavaris-pm Hi! I can integrate your model into the new pos tagging function since I currently working on mine and nearly finished it.

pavaris-pm commented 11 months ago

@pavaris-pm Hi! I can integrate your model into the new pos tagging function since I currently working on mine and nearly finished it.

@MpolaarbearM Great to hear that! However, I have trained 2 pos tagging models. Which model will be integrated in ? Do we need any consideration from @wannaphong ?

MpolaarbearM commented 11 months ago

We can wait for approval. But the methods of integration are the same, so I'll do both of them for now.

pavaris-pm commented 11 months ago

Thanks for your help, after approval, please inform me when it is integrated 👍🏻