PyThaiNLP / pythainlp

Thai Natural Language Processing in Python.
https://pythainlp.org/
Apache License 2.0
936 stars 272 forks source link

Add Thai word list from ICU BreakIterator dictionary #879

Closed pavaris-pm closed 7 months ago

pavaris-pm commented 7 months ago

What does this changes

@wannaphong @bact from issue #877 since ICU are included to almost all web browser, i've added ICU dictionary to PyThaiNLP where file of ICU dictionary are named as icubrk_th.txt and their python file to load the corpus are named as thai_icu.py krub.

Will resolve #877

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

pep8speaks commented 7 months ago

Hello @pavaris-pm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2023-12-06 10:17:06 UTC
wannaphong commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

pavaris-pm commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

wannaphong commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

Yes 👍

bact commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

pavaris-pm commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong i already add comment filtering by adding a new parameters named discard_comments where the default value is set to be False. You can review the code from the latest commit krub

pavaris-pm commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong I've made some experiment to test the discard_comments parameters and fix some bugs from it. Now it works perfectly. feel free to review from now on krub. It's done 💯

sonarcloud[bot] commented 7 months ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

bact commented 7 months ago

Merged thank you.