Add Thai word list from ICU BreakIterator dictionary

pavaris-pm commented 7 months ago

What does this changes

@wannaphong @bact from issue #877 since ICU are included to almost all web browser, i've added ICU dictionary to PyThaiNLP where file of ICU dictionary are named as icubrk_th.txt and their python file to load the corpus are named as thai_icu.py krub.

Will resolve #877

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

[x] Passed code styles and structures
[ ] Passed code linting checks and unit test

pep8speaks commented 7 months ago

Hello @pavaris-pm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2023-12-06 10:17:06 UTC

wannaphong commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

pavaris-pm commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

wannaphong commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

Yes 👍

bact commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

pavaris-pm commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong i already add comment filtering by adding a new parameters named discard_comments where the default value is set to be False. You can review the code from the latest commit krub

pavaris-pm commented 7 months ago

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong I've made some experiment to test the discard_comments parameters and fix some bugs from it. Now it works perfectly. feel free to review from now on krub. It's done 💯