etsy / statsd-jvm-profiler

Simple JVM Profiler Using StatsD and Other Metrics Backends
MIT License
330 stars 93 forks source link

Profilers Exit on Exceptions #28

Closed jfenc91 closed 8 years ago

jfenc91 commented 8 years ago

Occasionally I get errors with my influxdb metrics writer like this:

java.lang.RuntimeException: timeout at org.influxdb.impl.InfluxDBErrorHandler.handleError(InfluxDBErrorHandler.java:19) at retrofit.RestAdapter$RestHandler.invoke(RestAdapter.java:242) at org.influxdb.impl.$Proxy0.writePoints(Unknown Source) at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:156) at com.etsy.statsd.profiler.reporter.InfluxDBReporter.recordGaugeValues(InfluxDBReporter.java:66)

While this unfortunate, and probably means I need to do more batches, record less frequently, or scale up influx, it would be nice if the profilers continued running. This could be done by adding some exception handling here:

https://github.com/etsy/statsd-jvm-profiler/blob/master/src/main/java/com/etsy/statsd/profiler/worker/ProfilerWorkerThread.java#L19

Also, nice work guys. This is a super awesome project!

ajsquared commented 8 years ago

I have run into this before, and it seemed to always be related to trying to send too many metrics at the same time. We can definitely have some exception handling, though that could mean some profiling data would be lost.

Just to confirm - you are using at least version 0.8.3? We had some sporadic problems with this error that https://github.com/etsy/statsd-jvm-profiler/commit/e839a2c9beec831fc49fcb5a39535d46f094c9bf fixed entirely.

jfenc91 commented 8 years ago

I was using the latest version with commit 2dfd5171eb89deee9e0b704aecf9114612995177 . It seems to happen with both the memory and cpu profiler. I am also using this with an elasticsearch cluster so I need it to be running for longer than maybe a map/reduce task would need.

I guess this does come down to ideal failure states. I would rather have a continual best effort approach with these metrics. But maybe that is not for everyone, especially while cpu profiling. For now, I have forked and added an http endpoint to track errors. I just wanted to post something here in case other people came across the same issue.

ajsquared commented 8 years ago

Ah, that makes sense. We can definitely support what you're looking for. As I think about it some more that behavior is a sensible default and if someone prefers to have the profiler fail if the metrics can't be written to InfluxDB they can enable that with an option.

Would you be interested in contributing back that endpoint?

jfenc91 commented 8 years ago

Sure. I'll need to clean it up a bit. But, I'll send a pull request your way in the next few days. Thanks!

ajsquared commented 8 years ago

Sorry for the delay, I should be able to take a look at adding some better exception handling shortly.

jfenc91 commented 8 years ago

Sorry for letting you down here. I haven't made/found the time to go back to this yet. I will open a pull request based on what I am currently running though.