fedora-infra / statscache

A daemon to build and keep fedmsg statistics
GNU Lesser General Public License v2.1
11 stars 9 forks source link

dispatch to statsd #15

Open ralphbean opened 9 years ago

ralphbean commented 9 years ago

Basically, it would be awesome if we could dispatch the output of all our wonderful plugins to other different systems -- in particular, to statsd.

Currently, we can't do this as all of our plugins hardcode their interaction with sqlalchemy to store their data (read: side-effects).

Instead, we should have the plugins return their data and then have the calling code dispatch that data to different handlers/callbacks/etc.. which then store the data in postgres, or send it to statsd, or memcached, or wherever.

The tricky part here will be to make some kind of common data object that can represent both the time-series information from the volume plugin and the singleton information from the releng plugins.

Full IRC log of the conversation that brought this up:

   relrod │ threebean: does statscache fit in anywhere with the graphite stuff I've been playing with?
   relrod │ Could it use graphite as an output or something maybe?
   relrod │ or statsd
   relrod │ threebean: I guess another way of asking this is: Can plugins be outputs too? Can I write a plugin that says "any time some other plugin stores a stat, toss it to statsd too"?
threebean │ relrod: hm, good idea.
   relrod │ threebean: Just don't want to duplicate effort. Especially for stuff like dashboards (I'm really liking grafana, it's pretty :P)
threebean │ the .. way it's written right now is that each plugin is responsible for shoving its data into postgres.
threebean │ but we could rework.
threebean │ but we could rework that.
threebean │ here's an example:  https://github.com/fedora-infra/statscache_plugins/blob/develop/statscache_plugins/volume/simple.py
   relrod │ threebean: I've been getting really heavily into doing (at least my Haskell projects) like this: Write core stuff as a library; then as part of that library include some common/default
          │ handlers. Then as a "config file" type thing, import that library and write handlers (or use built in ones), then crete a mapping between some key and a list of callback handlers.
   relrod │ The reason I mention that is...
   relrod │ I'm wonder if plugins could return some common data type (er, object in this case :P). Then we can have lists of callbacks that each get passed that object, every time a stat is changed
   relrod │ One of those callbacks could go to postgres, one to statsd, etc.
threebean │ yeah, that'd be the way to do it.
   relrod │ (one to fedmsg - we could create an infinite loop :P)
threebean │ superb :p
threebean │ I like it.  let's do it.
threebean │ there's some trickiness in there about an edge case we wanted statscache to handle.
threebean │ see how it can handle updating pre-existing records?  that's one.
threebean │ another oddity is that not all the stats are time-series (i'm assuming that statsd is only for time-series data)
threebean │ the releng plugins in there store only information about the latest state for the composes and stuff.
threebean │ making a common object interface between those two kinds of data (time-series and otherwise) seems like it would be difficult.
threebean │ s/difficult/easy to get wrong/  :)
   relrod │ hm, yeah fair
threebean │ otoh, having statscache stuff in statsd and then grafana would be amazing.
threebean │ we could build app-specific visualizations for custom use-cases like fedora-hubs, but then get generic yet beautiful graphs for ourselves.
   relrod │ threebean: btw, here's an example of what I meant above: https://github.com/fedora-infra/pagure-hook-receiver/blob/master/examples/pagure-test.hs -- notice how projectMapping is keyed on the
          │ pagure project name, then the callbacks can do whatever. I've only recently started doing that kind of development, but I like it a *lot*
   relrod │ It's how the xmonad window manager works too ;)
   relrod │ xmonad is a library, your config file is a Haskell program that makes use of it
threebean │ cool :)
   relrod │ and compiles into a window manager
   relrod │ threebean: but yeah. It sounds like a great usecase for graphite, I think (the time-series ones at least)
threebean │ maybe, hm..
threebean │ maybe we have all plugins return some object representation of their data.. and the calling code dispatches to the postgres store and statsd store and etc..
threebean │ but we only send time-series stuff to statsd
   relrod │ yeah
threebean │ and we have some variety of objects with the know-how to store the different types in postgres.
threebean │ (like we do now -- except that logic is scattered throughout the plugins themselves)
   relrod │ have a type with two constructors, one for time series and one for singleton, and pattern match on the constructor - oh wait, we're not in Haskell. ;)
        * │ threebean tries to file an issue for this
   relrod │ threebean: yeah I think having an object representation is probably better in the long run, because then we can write output plugins ("callbacks" in my terminology above) for things we
          │ haven't even thought of yet.
threebean │ right.  like memcached or websockets.
   relrod │ yep :)
rtnpro commented 9 years ago

:+1: This would be fun and tricky as well. But, it's definitely the way to go. However, we'll need some kind of datastore, still, in the plugins, so that all the states are dumped to the datastore and the apps remain stateless. We'll be able to easily scale, then.

We need to give more thought on it, though. I will package the current statscache for deployment, first before jumping into the rewrite.

relrod commented 9 years ago

@rtnpro In my idea above (in the log), dumping data out to the data store would be the responsibility of the callback functions. So the plugins would only ever return a data structure. Importantly, they (as @ralphbean mentioned) would never side-effect to actually store the data somewhere.

So, using the volume/by_topic plugin as an example, we'd have callback functions for each of the following:

The plugin would contain whatever the current handle function does line 28 to 32, and then return that result as part of its object structure that it returns.

The biggest problem I see here is that if the plugin depends on some kind of previously stored data to do its job. Like if there is a plugin that needs to reach out to the database, grab a number, and add that to some field in whatever triggered that plugin to be run. In Haskell, I'd fix this by using the Reader monad, but I don't know of the cleanest Python way. However, this also gives preference to one backend store which seems counterintuitive -- it says "if you need data, get it from this hardcoded source, as opposed to this one over here" which seems bad/limiting. Maybe it's better to enforce that plugins aren't aware of the current datastore state, but can still signal that they want to modify it. So a plugin can say, for example, "I don't know what the current value of the data store is, but I know I want to increment it by 1."

But then we're reinventing statsd.