PyThaiNLP / pythainlp

Thai Natural Language Processing in Python.
https://pythainlp.org/
Apache License 2.0
975 stars 272 forks source link

Improve PyThaiNLP performance #685

Open wannaphong opened 2 years ago

wannaphong commented 2 years ago

PyThaiNLP wants you help us to improve the performance. You can fork this git, coding new code to improve the performance, and send your pull request to PyThaiNLP.

These are some lists for you.

BLKSerene commented 2 years ago

I propose to remove tinydb as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.

tinydb is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json in the Python standard library instead.

And it is confusing that while tinydb is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).

The performance improvement gained by using tinydb or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json in the standard library.

What is your opinion? If you agree, I would be happy to work on this and send an PR.

wannaphong commented 2 years ago

I propose to remove tinydb as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.

tinydb is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json in the Python standard library instead.

And it is confusing that while tinydb is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).

The performance improvement gained by using tinydb or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json in the standard library.

What is your opinion? If you agree, I would be happy to work on this and send an PR.

I am agree.

BLKSerene commented 2 years ago

When working on #691, I found a problem relating to the version checking behavior of the corpus downloader.

The downloader checks both the name and the version here, so later if the corpus is found in the local database and a re-download is not forced, the versions should always match.

I'm not sure what the expected behavior is here. If a corpus with the same name but different version is found in local database (and force = True is not specified), should the downloader suggest the user to use force = True or just silently re-download and rewrite the latest version of the corpus?

wannaphong commented 2 years ago

Yes, It is not force download that you can see https://github.com/PyThaiNLP/pythainlp/runs/7957249794?check_suite_focus=true#step:5:5219 and https://github.com/PyThaiNLP/pythainlp/blob/dev/tests/test_corpus.py#L86.

BLKSerene commented 2 years ago

@wannaphong I mean that since both the name and the version of the corpus is checked, the else block would never be reached, so the user will never be notified that a newer version of the requested corpus is available.

That is, in cases that the corpus is found in the local database, but the version do not match, the latest version of the corpus would be silently downloaded and rewritten. If this is indeed the expected behavior, the else block would be redundant. If the user should be notified in these cases, the checking logic should be modified.

wannaphong commented 2 years ago

OK. I'm agree.

BLKSerene commented 2 years ago

@wannaphong So what is the expected behavior? If the user should be notified to use force = True when newer versions of the corpus are available, I could work on a patch to fix it.

wannaphong commented 2 years ago

@wannaphong So what is the expected behavior? If the user should be notified to use force = True when newer versions of the corpus are available, I could work on a patch to fix it.

I'm agree. The user should get notified when newer versions are available.

BLKSerene commented 2 years ago

@wannaphong Should be fixed in #692, please review the PR.

wannaphong commented 1 year ago

I'm doing reduce import time. #719