Closed nguillaumin closed 10 years ago
Is this issue still occurring? I haven't heard any other reports of this and have been unable to reproduce.
Hi, sorry for the delay.
I'm running 3.0.1 with this issue, do you have any reason to believe it would be fixed by a change in 3.0.2?
I'll try to upgrade and let you know.
I upgraded to 3.0.2 but unfortunately that's still happening.
I agree it's difficult to reproduce, that might happen only under stress as I can't reproduce it either on my local development environment.
One thing I noticed by increasing logging is that there are some connection problems to my Graphite server from time to time:
2014-04-01 15:42:48,285 [metrics-graphite-reporter-thread-1] WARN graphite.GraphiteReporter - Unable to report to Graphite
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.<init>(Socket.java:425)
at java.net.Socket.<init>(Socket.java:241)
at javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277)
at com.codahale.metrics.graphite.Graphite.connect(Graphite.java:71)
at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:153)
at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:116)
at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:87)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Maybe there's an error handling issue here that would corrupt subsequent data submissions?
Looking at GraphiteReporter.report()
, it might be possible that graphite.connect();
blocks while the connection is established. If that takes more than the reporting interval (1mn in my case) then another thread would call report()
again, possibly stepping on the toes of the previous one?
The ScheduledReporter (superclass of the GraphiteReporter) uses a single-threaded scheduled executor, making that impossible. The only way that could be possible is if you're reusing the Graphite instance.
@nguillaumin You might want to check the data as they arrive to the server with f.ex. graphite-server$ sudo ngrep -d any port 2003
(provided Carbon runs at 2003; add more filtering to limit it further, f.ex. sudo ngrep -d any <substring expected in the packet> port 2003
).
Thanks for the suggestion, that's a good idea, however I think I already achieved the same effect by looking at the carbon-cache creates.log
that shows what Graphite is receiving, and that seem to be corrupted (see the initial description of the bug report)
I tried it anyway with tcpdump
and it seem to confirm that. See for example this batch of metrics that were submitted, they're mostly fine but one of them is corrupted:
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-Java-Encodings._default.total-matching.p98 2.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-Java-Encodings._default.total-matching.p99 2.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-Java-Encodings._default.total-matching.p999 2.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.count 1 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.max 31 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.mean 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.min 31 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.stddev 0.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p50 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p75 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p95 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p98 31.00 1^@^@ ^@Free4
auto-w0ÅOþ^Cøÿÿ^P^F>^E<80>úÿÿernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p99 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p999 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.total-matching.count 1 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.total-matching.max 31 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.total-matching.mean 31.00 1399380474
A couple of other examples:
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServer-auto-win2012-64bit._default.queries.m5_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServer-auto-win2012-64bit._default.queries.m15_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServer-auto-win2012-64bit._default.queries.mean_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2^@^@Ð^@FreerverMeta0ÈOþ^Cøÿÿ^P¢'^F<80>úÿÿbit._default.queries.count 2 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServerMeta-auto-win2012-64bit._default.queries.m1_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServerMeta-auto-win2012-64bit._default.queries.m5_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServerMeta-auto-win2012-64bit._default.queries.m15_rate 0.00 1399385094
auto-centos5-64bit.modernui.Admin.collection.Se2-MediatorPushInstantUpdate-auto-centos5-64bit._default.queries.m5_rate 0.00 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-MediatorPushInstantUpdate-auto-centos5-64bit._default.queries.m15_rate 0.00 1399385204
auto-centos5-64bit.moderntìhS¦V^L^@<8a>^\^@^@<8a>^\^@^@^@%dNsä^@^L)^K<94>²^H^@E^@^\|Z @^@@^F ÅÀ¨^Q#À¨^Q:Üù^GÓâÑ(n^Cß<91>¦<80>^P^@.À^\^@^@^A^A^H
.@ïiJ[cui.Admin.collection.Se2-MediatorPushInstantUpdate-auto-centos5-64bit._default.queries.mean_rate 0.00 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-Meta-Hidden-ManageCollectionOptions._default.queries.count 2 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-Meta-Hidden-ManageCollectionOptions._default.queries.m1_rate 0.00 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-Meta-Hidden-ManageCollectionOptions._default.queries.m5_rate 0.00 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-Meta-Hidden-ManageCollectionOptions._default.queries.m15_rate 0.00 1399385204
I'm seeing that on our Graphite test server. Since we've deployed it on production and everything works fine, so it must be something with the test environment. I rebuilt the server on a different hardware and it was fine for a couple of weeks but now I'm seeing corruption again.
In any case it doesn't seem to be related to the Metrics library, probably has more to do with the Graphite setup, so I'm happy for this one to be closed.
I have had the same issue, and the cause was that I used the same socket for multiple graphite reporters, and thus different reported metrics got mixed up, the resolution was to use different socket per reporter
Hi,
Since I upgraded from 2.2.0 to 3.0.1 I see corrupted data in on my Graphite storage folder:
Those folders are supposed to be host names like
auto-win2003_company_com
, and then in these hostnames there are various metrics. As you can see some folders are correct, but some other are completely corrupted. I tried to clean up the data, but such corrupted folders keep re-appearing.I did not observe this behavior at all with 2.2.0, and I'm not aware of any change on the Graphite server that could explain it.
The code I'm using to initialize the reporter is:
I wasn't specifying a charset initially, I tried with US-ASCII after I read #441, but it didn't help.
Looking at the
creates.log
on my Graphite server I can see some of those corrupted names coming in:That seems to suggest that the problem is coming from the metric sent, not from Graphite itself. It seem to affect different metrics, of different types.
I'm not sure what to do next to try to diagnose this issue?
Thanks,
Nico