Open tfardet opened 1 year ago
if you have ideas on how to do it, feel free to make a PR
I have some. @tfardet do you want to do it or do I assign myself?
Haven't had the time to work on it yet (and leaving for 15d). @tfardet if you have the time to give it a try before I'm back, don't hesitate :-)
Ok, I think my solution is working (on local machine for now). I use mockups of requests.Session
and use instead requests_cache.Session
from requests-cache module (it might be a good solution by the way to simplify the caching of pynsee
, but that's another subject).
I had some trouble due to multiprocessing being used in get_population
and I had also to mock multiprocessing.Pool
, but now this seems to work.
I have some questions:
get_population
(tests\localdata\test_pynsee_localdata.py) results in a 372Mo database. I can of course use multiple database (say one per unit test), but that will mean a lot more work on artifacts' storage. @hadrilec do you know what kind of account is InseeFrLab using? The size limit is linked to it...@hadrilec just one more question. For testing purposes, I'm caching results by monkeypatching .get
methods; that means, this is happening after the calls to _wait_api_query_limit
. I see that _request_insee
is also handling 429 error code (which IMHO should be enough to handle the bottleneck.
Is there a technical reason to have both?
Do you think this would cause trouble to patch _wait_api_query_limit
(and nullify it's effect) while testing? If we adjust the caching timeout correctly, we should (mostly) rely on the caching backend (except for once in 15days...).
@hadrilec just one more question. For testing purposes, I'm caching results by monkeypatching
.get
methods; that means, this is happening after the calls to_wait_api_query_limit
. I see that_request_insee
is also handling 429 error code (which IMHO should be enough to handle the bottleneck.Is there a technical reason to have both?
Do you think this would cause trouble to patch
_wait_api_query_limit
(and nullify it's effect) while testing? If we adjust the caching timeout correctly, we should (mostly) rely on the caching backend (except for once in 15days...).
I think either you or @tfardet modified _request_insee
in order to catch error 429. The main objective of _wait_api_query_limit
is to compute the minimum waiting time required in order to not have an error 429, so the function tracks when the last queries were made and then sleeps during some time in order for the number of queries made in the last 60 seconds to be 30. I dont know whether if you make a new request while you already made too many in the last minute the API adds up this new request to the total or not. In such case, if _wait_api_query_limit
is disabled the waiting time might by infinite because _request_insee
sleeps only during 10 seconds while a longer time might be needed in order for the number of queries made in the last minute to be lower than 30.
In a nutshell, if I understood well how this works, I would say that yes indeed we could remove _wait_api_query_limit
but in this case either the minimum sleeping time should be computed, and _wait_api_query_limit
comes back or it should be by default 60 seconds, which might entail a longer average waiting time for the user. All this should be properly tested to make sure the rationale described above makes sense. I would say that it could be made in another PR.
EDIT: After a second thought, what is written above is not accurate. Sorry for this. As _request_insee
sleeps 10 sec after a 429 error, the number of queries made in the last minute will automatically decrease below 30. Consequently, _wait_api_query_limit
is not needed anymore. But the package should be tested without to check that everything is fine without this function.
Ok, I think my solution is working (on local machine for now). I use mockups of
requests.Session
and use insteadrequests_cache.Session
from requests-cache module (it might be a good solution by the way to simplify the caching ofpynsee
, but that's another subject).I had some trouble due to multiprocessing being used in
get_population
and I had also to mockmultiprocessing.Pool
, but now this seems to work.I have some questions:
- the sqlite database generated by the caching is process going to be huge (for an object stored on github). Retrospectively, I should have guessed that 🙄... So, testing only
get_population
(tests\localdata\test_pynsee_localdata.py) results in a 372Mo database. I can of course use multiple database (say one per unit test), but that will mean a lot more work on artifacts' storage. @hadrilec do you know what kind of account is InseeFrLab using? The size limit is linked to it...- what caching timeout should we set? For now, I started with 15days which could be a good compromise for some databases (SIRENE for instance); we could also set different timeouts for each webservice (6months for COG for instance?). @hadrilec @tfardet do you have suggestions?
- I'm just starting to use CI/CD. I might need a hand to rewrite the pipelines to store artifacts 😄 !
@linogaliana or @avouacr could you please tell us what is the kind of github account Insee has? I understand the question from @tgrandje is how much data can we stored?
The documentation of pynsee relies on jupyter notebooks that are a bit heavy. @RLesur complained about it in #30. Consequently, @tgrandje I guess the account is quite limited in terms of storage. To be confirmed by @avouacr. btw, @tgrandje if you have time and willingness #30 could be a very nice use case for playing with CI/CD 😄
For the caching timeout, I think 15 days is ok.
@hadrilec @tgrandje We use the GitHub Free plan for each of our organizations, which is obviously pretty limited by default. But for various purposes (mainly Onyxia related storage), we buy each year a data pack of Git LFS, which means we have 50GB of LFS storage and 600GB of bandwidth per 15d. But, there is no warranty that we'll continue to buy this storage long-term.
A more sustainable solution could be to use the S3/MinIO storage of the SSP Cloud to store your caching DB, and put it in public so as to let the package/CI job access it via a public URL. I think @linogaliana uses this strategy for the cartiflette package.
Good idea @avouacr. I'll try to fix something : with everything in cache, I've just ended with 1.7 Go of SQlites. Do you think this can be deployed as it is (until I manage to work with MinIO) ?
Can you also confirm that you can upload to the public MinIO without the (daily) tokens ? For cartiflette
, I think it's either upload from inside SSP Cloud (or with tokens) or download from the public cache...
+1 for the S3 solution
@tgrandje It is possible to create service account that would not expire and would be provided to Github Actions secrets. We are not doing that in cartiflette
yet because we are still manually producing files but as you know we will improve that later on
@tgrandje I think it would be preferable to work with MinIO directly as putting large files in a git project can end up being tedious to manage. Maybe you could start a PR with the current state of you work on caching and I could help setting up the communication with MinIO ?
I think it would be preferable to work with MinIO directly as putting large files in a git project can end up being tedious to manage.
Yeah: @tgrandje if the "deployed as it is" in your comment meant "tracked as files in git", then I really suggest to find another way as this will be extremely impractical to manage
What I meant was storing files through github actions' artifacts cache, not through git. That might have been practical enough and might have reduced data transfer, but I understand the cost is not trivial (as a matter of fact, I haven't found how to evaluate it on their calculator).
The current branch is here. I first started to split SQLites between modules as I thought the 500Mo displayed on github actions' threshold displayed in the doc was per file: it can easily be re-merged (I'll do that asap) to simplify the exchange with MinIO.
It should also be easy to add a try/except in a pytest_sessionstart
/pytest_sessionfinish
to interact with MinIO. @avouacr I'll set up the functions and let you know where to work.
Right now tests are taking a lot of time and probably using a lot of bandwidth downloading everything. It's also a problem when servers are down (like at the moment).
Suggestion:
Pros: faster tests and lower impact on servers, do not fail tests that are irrelevant to PRs Cons: need to check manually that relevant tests ran properly for relevant PRs