PyThaiNLP / pythainlp

Thai Natural Language Processing in Python.
https://pythainlp.org/
Apache License 2.0
975 stars 272 forks source link

Update `pos_tag_transformers` function #865

Closed pavaris-pm closed 10 months ago

pavaris-pm commented 10 months ago

What does this changes

from #866, i've updated pos_tag_transformers function by clean up the code, add docstring, fix deprecation, and change the output format of the function to make it be the same format as other tagger in PyThaiNLP

What was wrong

in #857 , pos_tag_transformers was added which consist of 3 models, however, to call and engine, the full name of it must be specified, also the output still not the same format as another tagger. For example

pos_tag_transformers(words="แมวทำอะไรตอนห้าโมงเช้า", engine = "bert-base-th-cased-blackboard")
# outputs
# [{'entity_group': 'NN', 'score': 0.910759, 'word': 'แมวมา', 'start': 0, 'end': 5},
#  {'entity_group': 'VV', 'score': 0.9462489, 'word': '##ทำ', 'start': 5,  'end': 7},
# {'entity_group': 'NN', 'score': 0.8325567, 'word': '##อะไรตอนห้าโมงเช้า',  'start': 7, 'end': 24}]

which is very hard for the normal user to remember its entire name, and may result in more mess in the internal code if another transformers model trained on new corpus are added. we will end up with a lot of if-else condition in order to call a model in the future

How this fixes it

i've cleaned up the code to let a user call a model with parameters named engine and corpus same as what we have from the former function that is pos_tag and pos_tag_sents and also fix output format. This will reduce how hard to remember the entire model name. Here is the newly added version:

from pythainlp.tag import pos_tag_transformers
txt = pos_tag_transformers(sentence="แมวทำอะไรตอนห้าโมงเช้า", engine="mdeberta", corpus='pud')
# outputs
# [[('แมว', 'NOUN'), ('ทําอะไร', 'VERB'), ('ตอนห้าโมงเช้า', 'NOUN')]]

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

pep8speaks commented 10 months ago

Hello @pavaris-pm! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 183:19: W291 trailing whitespace Line 186:2: E225 missing whitespace around operator Line 193:101: E501 line too long (135 > 100 characters) Line 194:101: E501 line too long (122 > 100 characters) Line 196:101: E501 line too long (140 > 100 characters) Line 226:15: E203 whitespace before ':' Line 230:24: E203 whitespace before ':' Line 231:19: E203 whitespace before ':' Line 249:101: E501 line too long (107 > 100 characters) Line 253:21: W292 no newline at end of file

Line 378:80: W291 trailing whitespace Line 379:63: W292 no newline at end of file

sonarcloud[bot] commented 10 months ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication