InseeFrLab / pynsee

pynsee package contains tools to easily search and download french data from INSEE and IGN APIs
https://pynsee.readthedocs.io/en/latest/
MIT License
67 stars 8 forks source link

Use cached data for tests? #164

Open tfardet opened 1 year ago

tfardet commented 1 year ago

Right now tests are taking a lot of time and probably using a lot of bandwidth downloading everything. It's also a problem when servers are down (like at the moment).

Suggestion:

Pros: faster tests and lower impact on servers, do not fail tests that are irrelevant to PRs Cons: need to check manually that relevant tests ran properly for relevant PRs

hadrilec commented 1 year ago

if you have ideas on how to do it, feel free to make a PR

tgrandje commented 1 year ago

I have some. @tfardet do you want to do it or do I assign myself?

tgrandje commented 1 year ago

Haven't had the time to work on it yet (and leaving for 15d). @tfardet if you have the time to give it a try before I'm back, don't hesitate :-)

tgrandje commented 1 year ago

Ok, I think my solution is working (on local machine for now). I use mockups of requests.Session and use instead requests_cache.Session from requests-cache module (it might be a good solution by the way to simplify the caching of pynsee, but that's another subject).

I had some trouble due to multiprocessing being used in get_population and I had also to mock multiprocessing.Pool, but now this seems to work.

I have some questions:

tgrandje commented 1 year ago

@hadrilec just one more question. For testing purposes, I'm caching results by monkeypatching .get methods; that means, this is happening after the calls to _wait_api_query_limit. I see that _request_insee is also handling 429 error code (which IMHO should be enough to handle the bottleneck.

Is there a technical reason to have both?

Do you think this would cause trouble to patch _wait_api_query_limit (and nullify it's effect) while testing? If we adjust the caching timeout correctly, we should (mostly) rely on the caching backend (except for once in 15days...).

hadrilec commented 1 year ago

@hadrilec just one more question. For testing purposes, I'm caching results by monkeypatching .get methods; that means, this is happening after the calls to _wait_api_query_limit. I see that _request_insee is also handling 429 error code (which IMHO should be enough to handle the bottleneck.

Is there a technical reason to have both?

Do you think this would cause trouble to patch _wait_api_query_limit (and nullify it's effect) while testing? If we adjust the caching timeout correctly, we should (mostly) rely on the caching backend (except for once in 15days...).

I think either you or @tfardet modified _request_insee in order to catch error 429. The main objective of _wait_api_query_limit is to compute the minimum waiting time required in order to not have an error 429, so the function tracks when the last queries were made and then sleeps during some time in order for the number of queries made in the last 60 seconds to be 30. I dont know whether if you make a new request while you already made too many in the last minute the API adds up this new request to the total or not. In such case, if _wait_api_query_limit is disabled the waiting time might by infinite because _request_insee sleeps only during 10 seconds while a longer time might be needed in order for the number of queries made in the last minute to be lower than 30.

In a nutshell, if I understood well how this works, I would say that yes indeed we could remove _wait_api_query_limit but in this case either the minimum sleeping time should be computed, and _wait_api_query_limit comes back or it should be by default 60 seconds, which might entail a longer average waiting time for the user. All this should be properly tested to make sure the rationale described above makes sense. I would say that it could be made in another PR.

EDIT: After a second thought, what is written above is not accurate. Sorry for this. As _request_insee sleeps 10 sec after a 429 error, the number of queries made in the last minute will automatically decrease below 30. Consequently, _wait_api_query_limit is not needed anymore. But the package should be tested without to check that everything is fine without this function.

hadrilec commented 1 year ago

Ok, I think my solution is working (on local machine for now). I use mockups of requests.Session and use instead requests_cache.Session from requests-cache module (it might be a good solution by the way to simplify the caching of pynsee, but that's another subject).

I had some trouble due to multiprocessing being used in get_population and I had also to mock multiprocessing.Pool, but now this seems to work.

I have some questions:

  • the sqlite database generated by the caching is process going to be huge (for an object stored on github). Retrospectively, I should have guessed that 🙄... So, testing only get_population (tests\localdata\test_pynsee_localdata.py) results in a 372Mo database. I can of course use multiple database (say one per unit test), but that will mean a lot more work on artifacts' storage. @hadrilec do you know what kind of account is InseeFrLab using? The size limit is linked to it...
  • what caching timeout should we set? For now, I started with 15days which could be a good compromise for some databases (SIRENE for instance); we could also set different timeouts for each webservice (6months for COG for instance?). @hadrilec @tfardet do you have suggestions?
  • I'm just starting to use CI/CD. I might need a hand to rewrite the pipelines to store artifacts 😄 !

@linogaliana or @avouacr could you please tell us what is the kind of github account Insee has? I understand the question from @tgrandje is how much data can we stored?

The documentation of pynsee relies on jupyter notebooks that are a bit heavy. @RLesur complained about it in #30. Consequently, @tgrandje I guess the account is quite limited in terms of storage. To be confirmed by @avouacr. btw, @tgrandje if you have time and willingness #30 could be a very nice use case for playing with CI/CD 😄

For the caching timeout, I think 15 days is ok.

avouacr commented 1 year ago

@hadrilec @tgrandje We use the GitHub Free plan for each of our organizations, which is obviously pretty limited by default. But for various purposes (mainly Onyxia related storage), we buy each year a data pack of Git LFS, which means we have 50GB of LFS storage and 600GB of bandwidth per 15d. But, there is no warranty that we'll continue to buy this storage long-term.

A more sustainable solution could be to use the S3/MinIO storage of the SSP Cloud to store your caching DB, and put it in public so as to let the package/CI job access it via a public URL. I think @linogaliana uses this strategy for the cartiflette package.

tgrandje commented 1 year ago

Good idea @avouacr. I'll try to fix something : with everything in cache, I've just ended with 1.7 Go of SQlites. Do you think this can be deployed as it is (until I manage to work with MinIO) ?

Can you also confirm that you can upload to the public MinIO without the (daily) tokens ? For cartiflette, I think it's either upload from inside SSP Cloud (or with tokens) or download from the public cache...

linogaliana commented 1 year ago

+1 for the S3 solution

@tgrandje It is possible to create service account that would not expire and would be provided to Github Actions secrets. We are not doing that in cartiflette yet because we are still manually producing files but as you know we will improve that later on

avouacr commented 1 year ago

@tgrandje I think it would be preferable to work with MinIO directly as putting large files in a git project can end up being tedious to manage. Maybe you could start a PR with the current state of you work on caching and I could help setting up the communication with MinIO ?

tfardet commented 1 year ago

I think it would be preferable to work with MinIO directly as putting large files in a git project can end up being tedious to manage.

Yeah: @tgrandje if the "deployed as it is" in your comment meant "tracked as files in git", then I really suggest to find another way as this will be extremely impractical to manage

tgrandje commented 1 year ago

What I meant was storing files through github actions' artifacts cache, not through git. That might have been practical enough and might have reduced data transfer, but I understand the cost is not trivial (as a matter of fact, I haven't found how to evaluate it on their calculator).

The current branch is here. I first started to split SQLites between modules as I thought the 500Mo displayed on github actions' threshold displayed in the doc was per file: it can easily be re-merged (I'll do that asap) to simplify the exchange with MinIO.

It should also be easy to add a try/except in a pytest_sessionstart/pytest_sessionfinishto interact with MinIO. @avouacr I'll set up the functions and let you know where to work.

tgrandje commented 1 year ago

Ok, so @avouacr I've just given rights to you in my repo. You should be able to work on I/O with MinIO here. What do you think ?