galaxyproject / usegalaxy-playbook

Ansible Playbook for usegalaxy.org
Academic Free License v3.0
30 stars 24 forks source link

Capture tool error stats in Grafana #178

Open jennaj opened 5 years ago

jennaj commented 5 years ago

@natefoo Last time we talked the first step was to load the error data into Grafana so we could all test out different ways of graphing, setting alerts, etc. Any updates about when that could happen? Or is the data in there but I missed it?

Capturing this in a ticket so we don't lose track of status

hexylena commented 5 years ago

We did this at one point. It was not very advanced, but I did not find it to be very actionable data unfortunately. Maybe for you all it would be nice to say "oh this destination is misbehaving" but since we only had 1-2 destinations it was not so interesting.

Edit: I am not trying to dissuade you, just wanted to let you know our experience. I hope you find ways to make this actionable

jennaj commented 5 years ago

I thinking more about usage metrics, but if you pool enough usage, certainly server/cluster issues could be detected (Main's jobs go many queues...).

Even something very simple would be useful but I'm a bit naive about what actual details could be parsed out. Something like this? Rather than waiting for bug reports or Qs.

Then see if we can find patterns about failure rate differences over time & get an early warning for tools that have away-from-historical-mean fail/success spikes, trigger an email once we know what the "normal" variances are. Goes a few steps beyond tool tests, are actual user usage metrics, big picture. Could indicate server problems but also tools that may need more love: tool form help, or defensive input "bad/incomplete" entry warnings, could use a tutorial/FAQ. I can think of lots of ways to use that data kind of data to make decisions.

Totally open to other ideas. Goals are just to get some basic usage metrics. If a tool is failing too much, we should find out why and try to remedy that, or least be aware of it so can prioritize what to work on (including "soft" changes: form help/tips, faqs, tutorials, form element placement).

Thoughts?

hexylena commented 5 years ago

I think those are all really good questions and actionable at that!

These are answerable questions though, I didn't have this clear of a picture in my head for the original issue. Additionally I did not find the time to ask all of these good questions when we originally did it + we did not have the volume of data needed to make it useful.

jennaj commented 5 years ago

Not super common but when it does happen, usually manifests as a flurry of bug reports then firefighting proceeds (sometime immediate, sometimes not). There has to be a better way of proactively tracking failures rather than relying on bug reports. Plus, bug numbers don't give the actual failure numbers -- only numbers about how many reported it and even that isn't really captured well anywhere (..could be another set of data points, parsed from submitted bugs).

Talking with @natefoo it seems the first step is to get the data into Grafana, then we can test out ways to graph/interpret the data. So sort of not knowing exactly how we'll use it at first is Ok.

hexylena commented 5 years ago

Talking with @natefoo it seems the first step is to get the data into Grafana, then we can test out ways to graph/interpret the data

yep, exactly what I'd suggest as well, it's easy enough to start collecting the data and can just figure it out later :)

There are two options probably @natefoo, it looks like I already added an 'influx' type backend for error reporting, that's what we used when we last tried it. But I don't think that reports which queue a job went to, or writing some sql queries you could helpfully add to gxadmin and then run with telegraf ;)

Plus, bug numbers don't give the actual failure numbers -- only numbers about how many reported it and even that isn't really captured well anywhere

This was exactly the reason I tried measuring once, without any specific agenda, just I was worried that I was missing a lot of failures due to people not reporting them as much as they should.

jennaj commented 4 years ago

Grafana has some status now (tools 100% failing, one-week "blocks" of time, only testing usegalaxy.org). The gxyadmin utility has more options but could be tuned.

https://stats.galaxyproject.org/d/Q3_EmS_Wk/main-stats?orgId=1

We can work from that to get the rest. Maybe create a toy database that isn't private. Run some public data/workflows. Hope to produce failures. Customize on a smaller dataset.