Capture tool error stats in Grafana

jennaj commented 5 years ago

@natefoo Last time we talked the first step was to load the error data into Grafana so we could all test out different ways of graphing, setting alerts, etc. Any updates about when that could happen? Or is the data in there but I missed it?

Capturing this in a ticket so we don't lose track of status

hexylena commented 5 years ago

We did this at one point. It was not very advanced, but I did not find it to be very actionable data unfortunately. Maybe for you all it would be nice to say "oh this destination is misbehaving" but since we only had 1-2 destinations it was not so interesting.

Edit: I am not trying to dissuade you, just wanted to let you know our experience. I hope you find ways to make this actionable

jennaj commented 5 years ago

I thinking more about usage metrics, but if you pool enough usage, certainly server/cluster issues could be detected (Main's jobs go many queues...).

Even something very simple would be useful but I'm a bit naive about what actual details could be parsed out. Something like this? Rather than waiting for bug reports or Qs.

tool
version
N total jobs
N success
N failed
N executing
N queued
N canceled while queued
N canceled while executing
success-rate mean, historical
success-rate mean, for time period selected
failure-rate mean, historical
failure-rate mean, for time period selected
stats around diffs of means (initially or add after once we have the other data to play around with)
most_common_failure_error, for time period selected (tricky? would require some stdin/err parsing to normalize, eg: XXX dataset names)
most_common_failure_job_queue_or_whatever (cluster?)
most_recent_failure_error, for time period selected (full error message, unparsed)
most_recent_failure_server/cluster_details (same as on job info page, admin view near the bottom)

Then see if we can find patterns about failure rate differences over time & get an early warning for tools that have away-from-historical-mean fail/success spikes, trigger an email once we know what the "normal" variances are. Goes a few steps beyond tool tests, are actual user usage metrics, big picture. Could indicate server problems but also tools that may need more love: tool form help, or defensive input "bad/incomplete" entry warnings, could use a tutorial/FAQ. I can think of lots of ways to use that data kind of data to make decisions.

Totally open to other ideas. Goals are just to get some basic usage metrics. If a tool is failing too much, we should find out why and try to remedy that, or least be aware of it so can prioritize what to work on (including "soft" changes: form help/tips, faqs, tutorials, form element placement).

Thoughts?

hexylena commented 5 years ago

I think those are all really good questions and actionable at that!

Would need to track "jobs finishing in past hour" and run on telegraf cron → push to influx (would have a couple of duplicates/missed data points occasionally but probably within acceptable losses), state, exit code, maybe the 'info'
tools that have away-from-historical-mean fail/success spikes

I do not know how to do this in influx+grafana, it is very difficult to alert on every tool, but alerting on a few key tools could be done. I wonder how common it is that a specific tool suddenly starts failing? Maybe due to a new, bad revision being pushed out?

These are answerable questions though, I didn't have this clear of a picture in my head for the original issue. Additionally I did not find the time to ask all of these good questions when we originally did it + we did not have the volume of data needed to make it useful.

jennaj commented 5 years ago

I wonder how common it is that a specific tool suddenly starts failing? Maybe due to a new, bad revision being pushed out?

Not super common but when it does happen, usually manifests as a flurry of bug reports then firefighting proceeds (sometime immediate, sometimes not). There has to be a better way of proactively tracking failures rather than relying on bug reports. Plus, bug numbers don't give the actual failure numbers -- only numbers about how many reported it and even that isn't really captured well anywhere (..could be another set of data points, parsed from submitted bugs).

Talking with @natefoo it seems the first step is to get the data into Grafana, then we can test out ways to graph/interpret the data. So sort of not knowing exactly how we'll use it at first is Ok.

hexylena commented 5 years ago

Talking with @natefoo it seems the first step is to get the data into Grafana, then we can test out ways to graph/interpret the data

yep, exactly what I'd suggest as well, it's easy enough to start collecting the data and can just figure it out later :)

There are two options probably @natefoo, it looks like I already added an 'influx' type backend for error reporting, that's what we used when we last tried it. But I don't think that reports which queue a job went to, or writing some sql queries you could helpfully add to gxadmin and then run with telegraf ;)

Plus, bug numbers don't give the actual failure numbers -- only numbers about how many reported it and even that isn't really captured well anywhere

This was exactly the reason I tried measuring once, without any specific agenda, just I was worried that I was missing a lot of failures due to people not reporting them as much as they should.

jennaj commented 4 years ago

Grafana has some status now (tools 100% failing, one-week "blocks" of time, only testing usegalaxy.org). The gxyadmin utility has more options but could be tuned.

https://stats.galaxyproject.org/d/Q3_EmS_Wk/main-stats?orgId=1

We can work from that to get the rest. Maybe create a toy database that isn't private. Run some public data/workflows. Hope to produce failures. Customize on a smaller dataset.

galaxyproject / usegalaxy-playbook

Capture tool error stats in Grafana #178