aurae-runtime / aurae

Distributed systems runtime daemon written in Rust.
https://aurae.io
Apache License 2.0
1.85k stars 91 forks source link

Runtime Conditions #293

Open krisnova opened 1 year ago

krisnova commented 1 year ago

We will need to bake in a way for pods, cells, etc to support generic runtime conditions that will need to remain true during the duration of execution.

For example we may want an in-memory cache to only "run" as long as there is a configurable amount of memory available in the system.

This conditions will likely need to be extensible. We will want the ability to check status on various mechanisms such as remote APIs, network connectivity, local health checks, remote health checks, etc, etc


What is the best way to define these conditions in Aurae? Do we want to implement a "reverse health check" style system that will follow a proof of exhaustion style set of checks and break if something fails?

krisnova commented 1 year ago

Note: We will likely need these at the "service level" as well as at the "node level"

dmah42 commented 1 year ago

this sounds like a scheduling problem ("run cache on nodes with >X Gib available").

am I thinking of something different?

krisnova commented 1 year ago

I was thinking more about failure modes.

"Run sidekiq as long as we can talk to the database"

I think we want to "fail quickly" in situations, such that scheduling mechanisms can quickly try to address whatever problem is going on

krisnova commented 1 year ago

Maybe a better example:

"Point all traffic at production as long as the backend is online, otherwise fail over to the replica"

I am unsure if this is a step in the "turing complete yaml" direction again -- this is just a thought i had

dmah42 commented 1 year ago

i think of all of these as scheduling issues. something needs to monitor the jobs that were started and if they're no longer running (if the service returns a failure code) then it needs to be rescheduled.

what we might need is plumbing from "job" to outer aurae health check/service discovery.

krisnova commented 1 year ago

so think about edge networking and failures

what do we do if a "node goes away" we should have some basic guarantees that a service wont end up running in 2 places just because wireguard broke, for example