TheRacetrack / racetrack

An opinionated framework for deploying, managing, and serving application workloads
https://theracetrack.github.io/racetrack/
Apache License 2.0
28 stars 5 forks source link

404 Not Found when calling a starting job with wildcarded version #457

Closed iszulcdeepsense closed 2 months ago

iszulcdeepsense commented 2 months ago

I've deployed job A with the newest jobtype. A is called by another model B and that model throws an error about failed to call job "A 0.x" by B: Client error '404 Not Found' for url 'http://pub:7005/pub/job/A/0.x/api/v1/perform' For more information check: https://httpstatuses.com/404: {"detail":"Not Found"} right after the A has been deployed in a newer version. It seems like the job gets a status 200 before it actually is callable.

Interesting Lifecycle supervisor logs:

[2m[2024-04-29 08:00:57][0m [0;32mDEBUG[0m Invoking hook "infrastructure_targets" of plugin kubernetes-infrastructure 1.4.0
[2m[2024-04-29 08:00:58][0m [0;33mWARN [0m Job A v0.29.3 is in bad condition: Job is still initializing, cause=RuntimeError, traceback=/mnt/plugins/extracted/kubernetes-infrastructure-1.4.0/monitor.py:79, /src/lifecycle/lifecycle/monitor/health.py:124
[2m[2024-04-29 08:00:58][0m [0;32mDEBUG[0m job A v0.29.3 changed status to: error
[2m[2024-04-29 08:01:01][0m [0;32mDEBUG[0m Jobs synchronized, count by status: {'error': 1, 'running': 23}

It seems like there was a race condition in Lifecycle supervisor sync loop. sync_registry_jobs loop lists jobs in a cluster to compare them with the state in the database. With unfortunate timing, list_jobs function of KubernetesMonitor may discover the new job and include it in its report with JobStatus.ERROR.value status, while it is not yet available.