PolicyEngine / policyengine-api

PolicyEngine's REST API for computing policy impacts.
GNU Affero General Public License v3.0
11 stars 26 forks source link

make debug-test fails with 404 error #1954

Closed mikesmit closed 2 weeks ago

mikesmit commented 2 weeks ago

When running make debug-test on the policyengine-api package I get the following error:

==================================== ERRORS ====================================
________________________ ERROR collecting tests/python _________________________
/usr/lib64/python3.11/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1204: in _gcd_import
    ???
<frozen importlib._bootstrap>:1176: in _find_and_load
    ???
<frozen importlib._bootstrap>:1147: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:690: in _load_unlocked
    ???
../policyengine-app/.venv/lib64/python3.11/site-packages/_pytest/assertion/rewrite.py:184: in exec_module
    exec(co, module.__dict__)
tests/python/conftest.py:12: in <module>
    download_microdata()
policyengine_api/download_microdata.py:11: in download_microdata
    dataset.download()
../policyengine-app/.venv/lib64/python3.11/site-packages/policyengine_core/data/dataset.py:308: in download
    raise ValueError(
E   ValueError: Invalid response code 404 for url https://api.github.com/repos/PolicyEngine/ukda/releases/tags/1.9.0.
mikesmit commented 2 weeks ago

From slack chat: Nikhil Woodruff Today at 4:43 AM Ah right, ok: so what’s happening here is the API uses our UK microdata for UK impacts. We’re not allowed to share outside the UK org, and the GH actions and live server use a token to download it. I recently changed the API to downloading it at startup time rather than whenever the first UK impact is received

Nikhil Woodruff Today at 4:44 AM We should probably add either a try except clause or env variable to not do this when testing non-UK parts

mikesmit commented 2 weeks ago

Looking to see if I can locate "I recently changed the API to downloading it at startup time rather than whenever the first UK impact is received"...

Looks like it's related to Split simulations into chunks #1938

Specifically this bit adding download_microdata.py

mikesmit commented 2 weeks ago

Chatted with @nikhilwoodruff this morning. TLDR: we want to load the UK data on demand, but without causing a race condition.

What's going wrong and why

  1. We used to load the file when needed to disk and then into memory
  2. When we added more than one worker task this caused a race condition where more than one worker would try to get the file. The "looser" would find the file, but it would be partially downloaded causing an error
  3. The solution was to pre-download the file before running any workers BUT
  4. The UK data cannot be downloaded to any old host (I.e. my desktop) because of legal issues

SO When you try to run the tests, then it immediately attempts to load UK data even though that is not necessary and fails because you can't access it without special permission.

Solutions

Preferred solution: We're loading this into memory anyway. Instead of caching it on disk we could just load it directly into memory once per worker

Pros:

Alternative - Atomic download

Download to a randomly-named tmp file and then use a move (if you move within the same disk partition this is atomic, not a copy) to the name you want.

Then either the file is there or not. If two processes do it at the same time, they will all download it, but one will "win".

Pros:

mikesmit commented 2 weeks ago

Looking a bit more closely... I'm not 100% confident that I know all the places file_path is used. I think I'll just solve this one problem of making the file download atomic (option 2) in this one case.

mikesmit commented 2 weeks ago

put in a pull request to core here: https://github.com/PolicyEngine/policyengine-core/pull/307 that will make the download/write atomic.

This should let us just back out the preloading of the microdata which will in turn mean we don't get errors for trying to download the UK microdata.