Yelp / firefly

Firefly is a web application aimed at powerful, flexible time series graphing for web developers.
ISC License
172 stars 44 forks source link

returning missing data from a source #34

Open filippog opened 12 years ago

filippog commented 12 years ago

Hi, thanks for firefly, very nice project!

we were experimenting with hooking up firefly to http://opentsdb.net and came across an issue, in the opentsdb case it might happen that when requesting multiple data points for the same time range some metrics might not have a datapoint all exactly at the same timestamp. In other words for a given timestamp the json array might look like e.g. [null, value1, value2].

So in the normal case, e.g.

[{"t": 1347526826, "v": [1083930.26667,639120.0]},
 {"t": 1347526841, "v": [1674841.46667,1189554.4]}]

When adding a different metric that might not have all the points lined up at the same timestamp it'd return e.g.

[{"t": 1347530423, "v": [null,null,413759.4]},
 {"t": 1347530426, "v": [828401.133333,634153.4,null]},
 {"t": 1347530438, "v": [null,null,639049.2]}]

this gets firefly quite confused if it has a null datapoint at a given timestamp. What is the right approach here? Interpolate at the data source object/backend and return the interpolated json or let firefly do the interpolation? firefly.renderer seems to have some support for null datapoints and interpolation though I'm not sure I'm getting it right and/or is functioning.

thanks!

fhats commented 12 years ago

Hi Filippo! Thanks for trying out Firefly - I hope it ends up being as useful for you as it has been for us!

I'm not very familiar with opentsdb, but if I understand what you said correctly, it sounds like you have a data source that does not always have data points for a given stat at a given timestamp. Firefly is designed to handle this case as a discontinuity in the graph (this is the null datapoint support and interpolator stuff you found). In the case of stats that have discontinuities, Firefly will simply not draw data across any contiguous segment of nulls. This means that in your last example, Firefly would draw stats 1 and 2 at their values for t=1347530426, but at the other times would leave stats blank (and if this was the only point with data, I don't think the renderer draws any lines). Is this what you were seeing?

If you expect to see data for stat 3 at e.g. t=1347530426 in your last example, you'll need to return an interpolated value instead of null in the data source. The way Firefly would do this if t=1347530426 did not exist would simply be to do linear interpolation, so that might be the best option for you.

Hope that helps!

filippog commented 12 years ago

Hey Fred, thanks for the quick reply!

that's correct, there might be some discontinuities in the data because the way opentsdb works the points might not be recorded at the same second (or minute or greater) boundary.

Here's what I'm seeing with one metric: http://imgur.com/4YVqI

with this json: http://paste.debian.net/189826/

The same metric, plotted together with the one from another host, showing http://imgur.com/cCPES http://imgur.com/khxJ5

with this json: http://paste.debian.net/189825/

I realize the data is very spotty from firefly's point of view but linear interpolation might work just fine

thanks, filippo

imbstack commented 12 years ago

Thanks for finding this!

It appears from the data in http://paste.debian.net/189825/ that each of the data sources never has 2 non-null data points in a row. I suspect this is causing an issue with our rendering of lines.

I'm going to add a testing generator that mirrors this case and see if I can get it fixed today.

filippog commented 12 years ago

sweet! thanks for the quick answer.

The explanation makes sense to me, here's another example where segments are drawn instead: http://imgur.com/sGSV2 with this json: http://paste.debian.net/189836/

hope that helps!

imbstack commented 12 years ago

I've created http://bl.ocks.org/3726172, which shows that d3 itself isn't happy about rendering these sorts of lines in the fashion that we are using it. (Notice the lack of a line on that plot). If either of you want to check this out, fork it and try to make it work somehow.

@fxh32 What do you think the best way to deal with this is? The way I see it, the options are

  1. Force data sources to interpolate themselves if their data presents in this manner
  2. Find if we are using defined() or interpolate(linear) incorrectly (this seems most likely)
  3. Patch our d3 fork to handle this case
fhats commented 12 years ago

I think we should go with option 1. I'm not really surprised D3 doesn't display a scatter plot when we expect it to draw lines. You need two points to draw a line and in this case we don't associate two data points separated by a null as being part of a line; rather, they are part of two distinct series with a discontinuity between them.

I think it would be reasonable to either:

imbstack commented 12 years ago

Over Saturday I worked on option 2, and brought back a custom interpolator. It is simply the linear interpolator except if a "line" consists of a single point, it draws what is basically a point. Results

I agree we should probably also allow a data source or graph to specify what sort of interpolation is desired, but this seems to be Doing the Right Thing™ in the case of this sort of data. Simply not presenting it at all is not a good behavior.

So I say we go with

Have data sources interpolate their own values in these cases (and maybe provide tools at the data source layer to make this easy). I like this option because it lets data sources make their own decisions about what a true discontinuity is and when simple linear interpolation (or something more complex?)

@filippog Would you like to get the simple fix I've made for your data source sooner or wait for a more complete solution?

filippog commented 12 years ago

thanks for the quick turnaround on this!

I think the custom interpolator works fine as a simple fix for this case, note that I've originally opened the issue to check whether or not this was the expected/desired behaviour. If it doesn't cause any regression and it doesn't need to be turned off/on (i.e. it doesn't add any modality) I'd say go for it!

I agree with @fxh32 re: providing (basic?) interpolation functions to data sources, I think I'm going to do interpolation on the datasource side for this particular case which should be simple enough.

Slightly unrelated, are there any plans to extend data sources even further to provide "value manipulation" facilities like percentiles in addition to interpolation?

thanks!

fhats commented 12 years ago

Hi, @filippog

@bis12 and I had a conversation out-of-band and decided that making it easy for data sources to interpolate over things we might otherwise consider discontinuities is the correct approach. If you happen to write a linear interpolator for your data source, it would be awesome if you could contrib it back (maybe in util.py?). Since your otsdb data source will probably not be the only data source which has these sorts of behaviours, we are thinking of maybe implementing some sort of data source base class which can take nulls by default and interpolate linearly over them.

Regarding your value manipulation facilities: we use upstream components to perform any additional manipulation of data, but there's no reason your data source couldn't also perform these functions. You could use the list_path method on DataSource to list additional stats for each stat you are capable of reporting (i.e. if you had a stat called otsdb_stat you could list otsdb_stat_50th and otsdb_stat_mean etc...). Then in the data method you would introspect the keys being asked for and perform the computation corresponding to the key in question.

Hope that's helpful! Let us know how it works out for you. If you want to contribute your otsdb datasource back, we'd love to have it in our collection!