etsy / statsd-jvm-profiler

Simple JVM Profiler Using StatsD and Other Metrics Backends
MIT License
330 stars 93 forks source link

graphite_dump.py isn't liking current stats format #43

Closed Ikariusrb closed 8 years ago

Ikariusrb commented 8 years ago

Running curl to see what's been dumped into graphite shows the following: curl "http://127.0.0.1/metrics/expand?query=stats.gauges.statsd-jvm-profiler.cpu.trace.*&leavesOnly=1"

{"results": ["stats.gauges.statsd-jvm-profiler.cpu.trace.1", "stats.gauges.statsd-jvm-profiler.cpu.trace.72", "stats.gauges.statsd-jvm-profiler.cpu.trace.73", "stats.gauges.statsd-jvm-profiler.cpu.trace.74", "stats.gauges.statsd-jvm-profiler.cpu.trace.75", "stats.gauges.statsd-jvm-profiler.cpu.trace.77", "stats.gauges.statsd-jvm-profiler.cpu.trace.org-apache-commons-daemon-support-DaemonLoader-load-200"]}

Running graphite_dump.py against the same server bails out like this: python2.7 ./graphite_dump.py -o 127.0.0.1 -s 12:40_20160316 -e 12:41_20160316 -p stats.gauges.statsd-jvm-profiler.cpu.trace

Traceback (most recent call last): File "./graphite_dump.py", line 83, in results = get_tree(host, args.prefix, args.start, args.end) File "./graphite_dump.py", line 49, in get_tree (min, max) = get_bounds(host, prefix) File "./graphite_dump.py", line 34, in get_bounds bounds = [int(bound.replace(prefix + '.', '')) for bound in json_results['results']] ValueError: invalid literal for int() with base 10: 'org-apache-commons-daemon-support-DaemonLoader-load-200'

Looking at the code, it's expecting the stats to be prefix + '.' and the rest a number, where the value it's choking on has a "-" for the separator.

ajsquared commented 8 years ago

Thanks!

If you're interested in working on a fix that's always appreciated. Otherwise would you be willing to test out a fix if I stick that on a branch?

Ikariusrb commented 8 years ago

I'm not sure enough of what things are supposed to look like to work on a fix myself, but I'm definitely willing to test out a fix from a branch.

ajsquared commented 8 years ago

Cool, thanks again. I'll update this when there's something available.

Ikariusrb commented 8 years ago

Would graphite_dump be fixed simply by teaching get_bounds to ignore results with a "-" separator, or is there more to the fix than that? If that's all there is to a fix, I can probably implement it myself and submit a pull request- otherwise I don't know what other things may need to happen.

ajsquared commented 8 years ago

I took a quick look and it's actually more complex then that. It used to be the case the profiler emitted a deeply nested tree of metrics, with periods instead of dashes in keys like org-apache-commons-daemon-support-DaemonLoader-load. Now that's now not the case and graphite_dump.py is making a bad assumption about how to get the dummy metrics that represent the bounds. I'll think a bit more on how best to fix this.

Ikariusrb commented 8 years ago

The CPU trace tree still looks deeply nested in my graphite, from the small amount of stats I've collected. Unfortunately, I'm not really certain what it should look like, to spot anything wrong. I'm installing influx to try and get some profiles running and flame graphs in the short term, but I've no plans to tear down the graphite install, so I should be easily able to drop back to that to test your changes.

ajsquared commented 8 years ago

Okay, sounds good. In any case we've found InfluxDB to work a lot better for visualizing this data than Graphite.

ajsquared commented 8 years ago

I've pushed a workaround to the graphite_dump_fix branch.

Ikariusrb commented 8 years ago

Didn't like https://github.com/etsy/statsd-jvm-profiler/blob/graphite_dump_fix/visualization/graphite_dump.py#L33 - dumps out w/

return (min(bounds), max(bounds))
ValueError: min() arg is an empty sequence

I replaced Line 33 w the non-idiomatic but functional:

    for bound in json_results['results']:
      x = bound.replace(prefix + '.', '')
      if x.isdigit():
        bounds.append(int(x))

After that, it got substantially further, but then dumped out @

get_max_metric
    return max([point[0] for point in json_results[0]['datapoints']])
IndexError: list index out of range

I'm just starting to troubleshoot that.

ajsquared commented 8 years ago

Ah yep, your change at line 33 is a better check. I should have been stripping off the prefix before checking isdigit.

Regarding the next error, my first guess would be that there aren't any datapoints for some metric in the time range specified.

Ikariusrb commented 8 years ago

Well, I found a start/end range that eliminated the issues in get_max_metric. I suspect it should probably get a check to see if it successfully fetched any values, and return an indicator which is then checked for before attempting to extend the leaves.

After that, I ended up with None for some of the values in the loop in format_output. I'm not sure why some keys have "None" for the values, but it should be possible to check for None as the value and skip that result.

Thoughts on these suggested adjustments?

ajsquared commented 8 years ago

If those changes are working out for you, that sounds good to me.