InseeFrLab / pynsee

pynsee package contains tools to easily search and download french data from INSEE and IGN APIs
https://pynsee.readthedocs.io/en/latest/
MIT License
67 stars 8 forks source link

Set http caching for tests #207

Open tgrandje opened 1 month ago

tgrandje commented 1 month ago
tgrandje commented 1 month ago

The current test results should not be considered at all. It will need to be merged first to be tested against the updated workflow (with requests-cache installed).

In time, we can add a pytest_sessionfinish to preserve the cache on an external source (most probably sspcloud's S3) between tests.

hadrilec commented 1 month ago

it looks great, could you please modify tests py files accordingly? some tests run only on some specific py versions. thanks

tgrandje commented 1 month ago

Oh, so that's why the 3.8 tests were almost always concluant :-). Sorry about that, I've missed it since my machine is currently running a 3.10 python...

I'll remove all python version checks, that should be fine with the refactored python versions control in the workflow.

tgrandje commented 1 month ago

Done

tfardet commented 1 month ago

@tgrandje the tests timed out, so something might not be working as expected with caching... or it's just that doing all the tests is too slow, maybe that's why the version test was there?

tgrandje commented 1 month ago

I'd say the second one. As I removed the selected tests on 3.8, those should take longer - and lately, I think I've sometimes had the 3.8 being the only one to conclude. I'd say we should try to merge it into the main branch (if you don't see anything amiss that is) and rerun some tests. If this fails, we can revert the merge, the changes are small enough. Just to be sure, I'll recheck the tests on my machine (I don't think I've re-done it).

tgrandje commented 1 month ago

Well, tests on y machine seems to be unconclusive: might be something wrong with my proxy (and/or me switching back and forth on different branches and juggling with the tempdirs).

Tests are not all finished, and I already have failures on some tests which were correctly ran on github. For instance, I got one failure on the 3rd test (out of 4) in test_pynsee_download, and it's perfectly fine on github...

If I got the time, I'll try to re-run some of those this evening on a personnal laptop outside my office's network (ie without proxies). But I may be short in time... If you don't hear from me before tomorrow, fell free to do what you think is best!

tgrandje commented 1 month ago

Ok, so I ran the tests outside of a corporate network and python 3.9.

I got 3 failing tests (which were silenced previously as they only ran on 3.8) :

FAILED tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_geo_list_1 - xml.etree.ElementTree.ParseError: syntax error: line 1, column 0 FAILED tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_local_data_1 - xml.etree.ElementTree.ParseError: syntax error: line 1, column 0 FAILED tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_local_data_all - xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

I've seen something about the arrow/gpd issue on the last one, but I'm not certain this triggered the failure. The 2 other failures seems to run ok outside of pytest, so maybe there is something wrong with the patching indeed. I will continue when I'm coming back in 2 weeks (or if anybody has the time to spare, feel free to pursue this)

tgrandje commented 1 month ago

Forget that. I ran it again (using cache) and that 3 failed test now passed. Maybe it has to do with a failure of the source, but the tests are ok.

tgrandje commented 2 weeks ago

[!NOTE] Personal note: tests computation times, without cache (results for tests > 30sec) on personal machine:

230.43s call     tests/macrodata/test_pynsee_macrodata.py::TestFunction::test_get_series_1
204.96s call     tests/geodata/test_pynsee_geodata.py::TestFunction::test_get_geodata_short
187.06s call     tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_geo_list_1
169.76s call     tests/download/test_pynsee_download.py::MyTests::test_download_big_file
159.25s call     tests/sirene/test_pynsee_sirene.py::TestFunction::test_request_sirene
131.43s call     tests/macrodata/test_pynsee_macrodata.py::TestFunction::test_get_date
118.71s call     tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_local_data_all
113.11s call     tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_local_data_latest_error
91.86s call     tests/macrodata/test_pynsee_macrodata.py::TestFunction::test_get_dataset_metadata_1
76.47s call     tests/macrodata/test_pynsee_macrodata.py::TestFunction::test_get_column_title_1
74.69s call     tests/macrodata/test_pynsee_macrodata.py::TestFunction::test_get_dataset_1
66.01s call     tests/download/test_pynsee_download.py::MyTests::test_download_file_all
57.10s call     tests/sirene/test_pynsee_sirene.py::TestFunction::test_search_sirene
51.43s call     tests/macrodata/test_pynsee_macrodata.py::TestFunction::test_download_idbank_list_1
43.04s call     tests/macrodata/test_pynsee_macrodata.py::TestFunction::test_get_series_list_1
40.78s call     tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_area_list_1
40.30s call     tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_included_area
40.12s call     tests/localdata/test_pynsee_localdata.py::TestFunction::test_get_population

All tests passed in 2077.28s (0:34:37) with python 3.9.

tgrandje commented 2 weeks ago

Ok, I've gone farther than I thought I'd do. So far:

I still got a problem with 3 tests (test_pynsee_localdata.py::test_get_geo_list_1, test_pynsee_localdata.py::test_get_local_data_1, test_pynsee_localdata.py::test_get_local_data_all). They seem to fail due to web miscommunication, but when I ran only those 3 (without the cache), they seem to pass. I'll make some further tests to see if this is linked to the caching or not.

tgrandje commented 1 week ago

It seems there is definitely something buggy with requests-cache and those 3 tests.

I'll transfer the 2 patches on proxies/exceptions into separate PR in case I can't resolve this bug...

tgrandje commented 1 week ago

There is something very strange going on on this bug.

As far as I understand, requests-cache uses requests under the hood to send http(s) requests, and those are not altered (with the only exception of the multipart form boundary, which is set as a constant to allow the hashing of queries). Yet, when I use the requests-cache patch, I always got some strange results, mostly on those 3 tests, so that doesn't seem to be a coincidence.

But, when I track the http response, I see that for those failures, we get perfectly valid XML results from INSEE's API instead of json. This is kind of funny as the server is not meant to serve XML results (according to the API's documentation anyway...).

For instance, for https://api.insee.fr/metadonnees/V1/geo/departements, I got:

<Departements>
    <Departement code="01" uri="http://id.insee.fr/geo/departement/84680e6f-2e99-44c9-a9ba-2e96a2ae48b7">
        <Intitule>Ain</Intitule>
        <Type>Departement</Type>
        <DateCreation>1967-12-31</DateCreation>
        <IntituleSansArticle typeArticle="5">Ain</IntituleSansArticle>
        <ChefLieu>01053</ChefLieu>
    </Departement>
    ...
    <Departement code="976" uri="http://id.insee.fr/geo/departement/7f7fe4af-ae1c-4641-80ef-5c8199aea45c">
        <Intitule>Mayotte</Intitule>
        <Type>Departement</Type>
        <DateCreation>2011-03-31</DateCreation>
        <IntituleSansArticle typeArticle="0">Mayotte</IntituleSansArticle>
        <ChefLieu>97611</ChefLieu>
    </Departement>
</Departements>

I think I've tracked every argument (proxies, stream, headers, ....) sent to the webserver and I can't see a difference between valid and unvalid results. I've also run the 3 problematic(?) tests alone (with pytest and the requests-cache patching), but everything worked fine. I've started to comment tests randomly in the file and I've even managed to have those strange results on another test altogether. I've also tried a temporisation of 10sec. between each tests and that had no impact.

I'm starting to think that might be a webserver failure, but I'm not sure either (why mostly those 3 tests, why only when patching, why would the temporisation have no impact...). If that's really the case, I see 2 ways out of here:

@hadrilec @tfardet do you see anything I'm missing?

(If anybody is ready to test this locally, please send me an email and I'll send you the S3 credentials.)

tgrandje commented 1 week ago

Ok, colleagues from INSEE put me on a lead and I finally got this after much debugging. It appears requests-cache does not take Accept header into acount to hash requests, and we have still some XML queries in the localdata app.

Regarding the (apparent) random results, that's entirely the results of me selecting different tests and/or the result of improper caching through that history...

For memory's sake:

I still got to check different backend on different machines to determine what's fastest a priori, but I hope we can soon test this directly on github.

tgrandje commented 1 week ago

Best backend seems to be SQLite (tested only between SQLite & filesystem, that took long enough 🙄 ).

Other features in this PR I've forgot to describe in conversation include the following:

I'm still not so sure we should conserve the compression of the artifacts: that will have to be tested in real conditions on github. The fact is that with compression, the restauration is fast (~3min), but the storage is real slow (like 19min slow on the same machine).

Further modifications to consider (not yet done, but for memory's sake):

In any case, for now I'd say someone can check that PR. If it's ok, we can move to real conditions tests and iterate from there.

hadrilec commented 4 days ago

Hello, I tried to run the tests on onyxia, I got the following error, an idea on how to proceed?

trying to restore artifact from SSP Cloud
downloading ...
took 5sec
extracting archive...
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/_pytest/main.py", line 281, in wrap_session
INTERNALERROR>     config.hook.pytest_sessionstart(session=session)
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/pluggy/_callers.py", line 139, in _multicall
INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/pluggy/_callers.py", line 122, in _multicall
INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/_pytest/logging.py", line 784, in pytest_sessionstart
INTERNALERROR>     return (yield)
INTERNALERROR>             ^^^^^
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>           ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/onyxia/work/pynsee/tests/conftest.py", line 175, in pytest_sessionstart
INTERNALERROR>     with py7zr.SevenZipFile(archive_path, "r") as archive:
INTERNALERROR>          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/py7zr/py7zr.py", line 416, in __init__
INTERNALERROR>     raise e
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/py7zr/py7zr.py", line 397, in __init__
INTERNALERROR>     self._real_get_contents(password)
INTERNALERROR>   File "/opt/conda/lib/python3.12/site-packages/py7zr/py7zr.py", line 436, in _real_get_contents
INTERNALERROR>     raise Bad7zFile("not a 7z file")
INTERNALERROR> py7zr.exceptions.Bad7zFile: not a 7z file
tgrandje commented 3 days ago

Apparently the file is corrupt (I downloaded and checked it with other softwares). Did you ran the tests multiple times? (As yet, the MinIO console is down, so we're stuck with the interface through Onyxia, which displays only the name of the file...)

On any machine (running the tests manually of course), you are also able to remove the previous cache by adding the flag --clean-cache to the pytest command.

That being said we should add a check on the 7z file's state before uploading it to sspcloud... 🤔 Unpacking is quite fast, so this should not slow the tests too much.

tgrandje commented 1 day ago

@hadrilec I've just added a quick fix to handle corrupted 7z files (I've also reverted some changes which were due to go to another PR).

If that's ok with you, I'll re-run the tests on my laptop ; you should then be able to test (one more time) directly on onyxia.

tgrandje commented 19 hours ago

OK, that's done.