PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
17.58k stars 1.65k forks source link

Support for Service Level Agreements (SLAs) #15374

Open aaazzam opened 2 months ago

aaazzam commented 2 months ago

Describe the current behavior

Many workflows I author using Prefect represent a data contract I have with a downstream consumer. For my downstream consumers, this looks like "my data must be ready by 1PM". For me as an author this SLA is expressed relatively, and looks like "the duration of my run that starts as 12PM can not exceed an hour".

Different flows often have different SLAs, and I need to be able to understand and take side effects against SLA violations. Today, I can query for runs whose duration or lateness exceeds a fixed threshold. To know which runs exceed their SLA, I need to hold that SLA metadata somewhere else and set up a separate system to measure and alert me.

Describe the proposed behavior

I'd like to be able to attach a datetime.timedelta to a flow or task and observe, view, and take side effects against SLA violations. A potential spelling is:

@flow
def my_critical_flow(...)
...

my_critical_flow.deploy(sla = timedelta(seconds=60)))

The outcome I'm looking for is the ability to find runs that violated an SLA, and to take a side effect at the time of violation. I'm sure there are a lot of spellings of this. I sometimes like to couple my SLA violation logic in code as I do for lifecycle hooks like on_completed, but otherwise I like to have a global policy / automation that lets me take a uniform action to respond to an SLA violation event.

Example Use

No response

Additional context

No response

aaazzam commented 2 months ago

On reflection, this makes much more sense at the deployment level than at the flow or task level.