Closed hydrosquall closed 6 years ago
@hydrosquall do you think it's possible to have generalized metric "duration of oldest dagrun for specific status"?
i.e. sometimes we have a problem when dagrun is stuck in "queued" state. we'd like to alert when dagrun is in queued for more than one hour.
Hi @elephantum -
I checked the DagRun
model, and it looks like there are only 3 possible DagRun states (running
, success
, failure
). I wanted to capture generalized duration for all 3 states, but the problem is that end_date
was not always stored on the dagrun.
Queued
status alerts are possible, but that felt to me like it would belong to a Duration of TaskInstance
metric instead.
You're right. I was thinking about TaskInstance
while your PR is about DagRun
. I'm checking it locally and merging.
@hydrosquall One more question: as I see in a situation when we have three simultaneous DagRun
s for the same dag_id
we'll have three metrics for this dag_id
. Will this be actually useful?
What is your target scenario for monitoring?
Good question! I believe each one of the DagRun
s will create a unique row with its own run_id
, and it's completely fine for multiple run_id
s to share the same dag_id
.
In the scenario where I'm using this, 1 dag_id
will only have 1 active run at a time. However, I don't believe that would break the sort of alerting that I want if there were multiple DagRuns happening concurrently, since I'm interested in being notified if any DagRun
s are going for beyond a certain period of time.
Ok, I can't see any issues with this approach.
This targets feature request #9. It Reports the duration that currently running DagRuns have been running for.
This can be used when people are trying to alert based on DagRuns that have gone on longer than expected.