This PR implements periodic refresh of stations and observations data.
In order to add these periodic checks, the PR introduces prefect as a dependency and implements the following prefect flows:
arpav_ppcv.prefect.flows.observations.refresh_stations - connects to the internal API and checks for existing observation stations. As mentioned in #112, this is now using the preferred API endpoint, which only shows stations with long time series
arpav_ppcv.prefect.flows.observations.refresh_monthly_measurements, arpav_ppcv.prefect.flows.observations.refresh_seasonal_measurements and arpav_ppcv.prefect.flows.observations.refresh_yearly_measurements - these connect to the internal API and discover new monthly, seasonal and yearly measurements for known stations
Flows are then configured into relevant deployments by using the prefect.serve() function.
prefect.serve() is very convenient for our use case, as it spawns a server which is able to execute flows locally. This means we can just spin up an additional docker container with the same image as the backend app and use it to perform flow execution. In the current system architecture, this is easily done by adding an additional service to the compose file.
This method, which the prefect docs refer to as static infrastructure, is a simpler alternative to the dynamic prefect way of doing things (i.e. have some separate storage for the flow code, perhaps in minIO, then create a prefect deployment, then have an additional prefect worker that downloads the flow code from this storage) and is a good match for this system. As such, this PR further introduces a new CLI command arpav-ppcv prefect start-periodic-tasks, which just spawns a dedicated prefect worker for processing the aforementioned flows.
The following new services are thus introduced to the docker compose stack:
prefect-server - This service runs the prefect server, which comprises of its scheduler, API and frontend UI;
prefect-static-worker - This merely uses the main arpav-backend image, but set to run the CLI command arpav-ppcv prefect start-periodic-tasks. It is effectively a prefect worker that runs the flows mentioned above on a periodic schedule
Periodic schedules
This PR configures flows with the following default schedules:
refresh_stations - runs every Monday, at 01h
refresh_monthly_measurements - runs every Monday at 02h
refresh_seasonal_measurements - runs every Monday at 03h
refresh_yearly_measurements - runs every Monday at 04h
These can be modified by means of setting the following environment variables:
The self-hosted version of prefect, which is what this PR introduces, does not include authentication - both the UI and API are open to whoever accesses them. As such, this PR introduces an additional layer of HTTP Basic Auth, employed at the traefik level. This ensures the prefect components which are exposed to the outside world are guarded with user credentials and coupled with traefik's TLS certs, should provide basic security.
observations harvester CLI command
This PR also replaces the standalone CLI commands that were being used to ingest observation-related data with the new prefect flows, thus always using the exact same code under all circumstances. The CLI did not change, it is just now being powered by the prefect flows. For example:
# this will refresh the stations
arpav-ppcv observations-harvester refresh-stations
This PR implements periodic refresh of stations and observations data.
In order to add these periodic checks, the PR introduces prefect as a dependency and implements the following prefect flows:
arpav_ppcv.prefect.flows.observations.refresh_stations
- connects to the internal API and checks for existing observation stations. As mentioned in #112, this is now using the preferred API endpoint, which only shows stations with long time seriesarpav_ppcv.prefect.flows.observations.refresh_monthly_measurements
,arpav_ppcv.prefect.flows.observations.refresh_seasonal_measurements
andarpav_ppcv.prefect.flows.observations.refresh_yearly_measurements
- these connect to the internal API and discover new monthly, seasonal and yearly measurements for known stationsThese flows leverage prefect's concurrency strategies to run tasks in parallel when possible. This is done by using
task_future = task.submit()
andtask_future.result()
.Flows are then configured into relevant deployments by using the prefect.serve() function.
prefect.serve()
is very convenient for our use case, as it spawns a server which is able to execute flows locally. This means we can just spin up an additional docker container with the same image as the backend app and use it to perform flow execution. In the current system architecture, this is easily done by adding an additional service to the compose file.This method, which the prefect docs refer to as static infrastructure, is a simpler alternative to the dynamic prefect way of doing things (i.e. have some separate storage for the flow code, perhaps in minIO, then create a prefect deployment, then have an additional prefect worker that downloads the flow code from this storage) and is a good match for this system. As such, this PR further introduces a new CLI command
arpav-ppcv prefect start-periodic-tasks
, which just spawns a dedicated prefect worker for processing the aforementioned flows.The following new services are thus introduced to the docker compose stack:
prefect-server
- This service runs the prefect server, which comprises of its scheduler, API and frontend UI;prefect-static-worker
- This merely uses the main arpav-backend image, but set to run the CLI commandarpav-ppcv prefect start-periodic-tasks
. It is effectively a prefect worker that runs the flows mentioned above on a periodic schedulePeriodic schedules
This PR configures flows with the following default schedules:
refresh_stations
- runs every Monday, at 01hrefresh_monthly_measurements
- runs every Monday at 02hrefresh_seasonal_measurements
- runs every Monday at 03hrefresh_yearly_measurements
- runs every Monday at 04hThese can be modified by means of setting the following environment variables:
Authentication
The self-hosted version of prefect, which is what this PR introduces, does not include authentication - both the UI and API are open to whoever accesses them. As such, this PR introduces an additional layer of HTTP Basic Auth, employed at the traefik level. This ensures the prefect components which are exposed to the outside world are guarded with user credentials and coupled with traefik's TLS certs, should provide basic security.
observations harvester CLI command
This PR also replaces the standalone CLI commands that were being used to ingest observation-related data with the new prefect flows, thus always using the exact same code under all circumstances. The CLI did not change, it is just now being powered by the prefect flows. For example: