brutasse / graphite-api

Graphite-web, without the interface. Just the rendering HTTP API.
https://graphite-api.readthedocs.io
Apache License 2.0
493 stars 131 forks source link

graphite-api not returning data with multiple retentions #165

Open gburson opened 8 years ago

gburson commented 8 years ago

Hello,

I'm really scratching my head here. We've been running a grafana/graphite-api/carbon/whisper stack for a while now and it's working generally ok. However, I've noticed that if I drill into data in grafana, once I get to a certain level of detail, the chart is blank.

Here is some config. Our storage schema looks like this, store on a 10 sec interval for 7 days, then 1 minute for 2 years.

[WebProd] priority = 90 pattern = ^Production..web._.WebServer.* retentions = 10s:7d,1m:2y

I can verify this in the whisper files themselves, like this: -

/usr/local/src/whisper/bin/whisper-dump.py /opt/graphite/storage/whisper/Production/Live/web/web2-vm/WebServer/Customer/HPS.wsp | less

Meta data:RETURN) aggregation method: average max retention: 63072000 xFilesFactor: 0

Archive 0 info: offset: 40 seconds per point: 10 points: 60480 retention: 604800 size: 725760

Archive 1 info: offset: 725800 seconds per point: 60 points: 1051200 retention: 63072000 size: 12614400

I've noticed the problem only happens, when querying data older than 7 days i..e after it's been averaged to a 60 second interval. If I pick a time period older than 7 days, across a three minute interval, and look directly inside the whisper file, it all looks good: -

/usr/local/src/whisper/bin/whisper-fetch.py --from 1454230700 --until 1454230880 /opt/graphite/storage/whisper/Production/Live/web/web2-vm/WebServer/Customer/HPS.wsp

1454230740 8.000000 1454230800 8.700000 1454230860 8.233333

However, if I query through graphite-api, it returns a 10 second interval (the wrong retention period, because I'm querying older than 7 days), and all items (even the ones that match the timestamps above) are null.

http://www.dashboard.com/render?target=Production.Live.web.web2-vm.WebServer.Customer.HPS&from=1454230700&until=1454230880&format=json&maxDataPoints=1000

[{"target": "Production.Live.web.571854-web2-vm.WebServer.Customer.HPS", "datapoints": [[null, 1454230710], [null, 1454230720], [null, 1454230730], [null, 1454230740], [null, 1454230750], [null, 1454230760], [null, 1454230770], [null, 1454230780], [null, 1454230790], [null, 1454230800], [null, 1454230810], [null, 1454230820], [null, 1454230830], [null, 1454230840], [null, 1454230850], [null, 1454230860], [null, 1454230870], [null, 1454230880]]}]

If I go for a wider time span, I start to get data back, but some are null and some are populated. What am I doing wrong?!

Thanks, Glen.

lukyanov commented 8 years ago

I can confirm the issue. While whisper itself returns data as expected, according to configured retentions, graphite-api only correctly works within the interval of first retention. Besides "zooming" described above it also seems to affect functions like timeShift().

gburson commented 8 years ago

Yes likewise, I've seen other issues now, and I think this is a general bug with graphite-api. I'll see if there is a way of raising a bug with the project.

gburson commented 8 years ago

My team have finally found the cause of this and fixed in the source so you can zoom in on old data, it was a bug in one copy of the whisper code we have: -

/usr/share/python/graphite/lib/python2.7/site-packages/graphite_api/_vendor/whisper.py

The call to read the data from the file had:

diff = untilTime - fromTime for archive in header['archives']: if archive['retention'] >= diff: break

this should be

diff = now - fromTime for archive in header['archives']: if archive['retention'] >= diff: break

the other copies of whisper.py on the server are all OK. Interestingly the incorrect one is a later version, the bug seems to have been introduced as a ‘fix’ here https://github.com/graphite-project/whisper/commit/ccd0c89204f2266fa2fc20bad7e49739568086fa , but with no explanation as to why the change was made.

If anyone could shed any light that would be cool!

brutasse commented 8 years ago

Here's a summary of the attempted "fix": https://github.com/graphite-project/whisper/pull/139

I ported it indeed, then reverted it but there has been no release since the revert.

I have juste pushed 1.1.3 which should fix the regression. Let me know how it works for you.

lukyanov commented 8 years ago

@brutasse Could you also update your docker image as well?