Open wannaphong opened 2 years ago
I propose to remove tinydb
as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.
tinydb
is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json
in the Python standard library instead.
And it is confusing that while tinydb
is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).
The performance improvement gained by using tinydb
or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json
in the standard library.
What is your opinion? If you agree, I would be happy to work on this and send an PR.
I propose to remove
tinydb
as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.
tinydb
is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done usingjson
in the Python standard library instead.And it is confusing that while
tinydb
is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).The performance improvement gained by using
tinydb
or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done usingjson
in the standard library.What is your opinion? If you agree, I would be happy to work on this and send an PR.
I am agree.
When working on #691, I found a problem relating to the version checking behavior of the corpus downloader.
The downloader checks both the name and the version here, so later if the corpus is found in the local database and a re-download is not forced, the versions should always match.
I'm not sure what the expected behavior is here. If a corpus with the same name but different version is found in local database (and force = True
is not specified), should the downloader suggest the user to use force = True
or just silently re-download and rewrite the latest version of the corpus?
Yes, It is not force download that you can see https://github.com/PyThaiNLP/pythainlp/runs/7957249794?check_suite_focus=true#step:5:5219 and https://github.com/PyThaiNLP/pythainlp/blob/dev/tests/test_corpus.py#L86.
@wannaphong I mean that since both the name and the version of the corpus is checked, the else block would never be reached, so the user will never be notified that a newer version of the requested corpus is available.
That is, in cases that the corpus is found in the local database, but the version do not match, the latest version of the corpus would be silently downloaded and rewritten. If this is indeed the expected behavior, the else block would be redundant. If the user should be notified in these cases, the checking logic should be modified.
OK. I'm agree.
@wannaphong So what is the expected behavior? If the user should be notified to use force = True
when newer versions of the corpus are available, I could work on a patch to fix it.
@wannaphong So what is the expected behavior? If the user should be notified to use
force = True
when newer versions of the corpus are available, I could work on a patch to fix it.
I'm agree. The user should get notified when newer versions are available.
@wannaphong Should be fixed in #692, please review the PR.
I'm doing reduce import time. #719
PyThaiNLP wants you help us to improve the performance. You can fork this git, coding new code to improve the performance, and send your pull request to PyThaiNLP.
These are some lists for you.