dropwizard / metrics

:chart_with_upwards_trend: Capturing JVM- and application-level metrics. So you know what's going on.
https://metrics.dropwizard.io
Apache License 2.0
7.84k stars 1.81k forks source link

Corrupted data with the GraphiteReporter #478

Closed nguillaumin closed 10 years ago

nguillaumin commented 11 years ago

Hi,

Since I upgraded from 2.2.0 to 3.0.1 I see corrupted data in on my Graphite storage folder:

[root@graphite whisper]# ls
0???????0???????it                        auto-w0???????0???????ernui
08i?????08i?????it                        auto-w0hQ)????0hQ)????ernui
0????????C??????it                        auto-w0?????????v?????ernui
0?)i?????3??????it                        auto-wi0???????0???????rnui
0??v?????c??????it                        auto-wi08i?????08i?????rnui
a0???????0???????t                        auto-wi0????????`S?????rnui
a?2i??????2i?????t                        auto-wi0??v?????cR?????rnui
aauto-win2012-64bit                       auto-wiauto-win2012-64bit
au0???????0???????                        auto-win2003
au08i?????08i?????                        auto-win2003-1000
au?`5??????S??????                        auto-win2003-900
au?_Q)?????_Q)????                        auto-win2008-1140
aut08i??????S??????modernui               auto-win2008-1240
aut0hQ)????0hQ)????modernui               auto-win2008-64bit
aut0hQ)?????P??????modernui               auto-win2008-mss
auto-0???????0???????dernui               auto-win2012-64auto-win2012-64bit
auto-0hQ)????0hQ)????dernui               auto-win2012-64bit
auto0hQ)?????ç?????odernui                auto-win20ries
auto-0?)i?????3??????dernui               auto-wi??Z?????????????rnui
auto-centos5-64bit_company_com  carbon
auto-centos5-830_company_com    manual-centos5_company_com
auto-centos5-900_company_com    pQueriesReport
auto-centos5_company_com        search-internal_company_com
auto-centos6-1000_company_com   stats
auto-centos6-1140_company_com   thinkpad-x230
auto-centos6-1240_company_com   vmstats
auto-centos6-mss_company_com

Those folders are supposed to be host names like auto-win2003_company_com, and then in these hostnames there are various metrics. As you can see some folders are correct, but some other are completely corrupted. I tried to clean up the data, but such corrupted folders keep re-appearing.

I did not observe this behavior at all with 2.2.0, and I'm not aware of any change on the Graphite server that could explain it.

The code I'm using to initialize the reporter is:

Graphite graphite = new Graphite(
    new InetSocketAddress(graphiteHost, graphitePort),
    SocketFactory.getDefault(), Charset.forName("US-ASCII"));
graphiteReporter = GraphiteReporter.forRegistry(registry)
    .prefixedWith(hostName+"."+MODERNUI_PREFIX)
    .convertRatesTo(TimeUnit.SECONDS)
    .convertDurationsTo(TimeUnit.MILLISECONDS)
    .build(graphite);

graphiteReporter.start(1, TimeUnit.MINUTES);

I wasn't specifying a charset initially, I tried with US-ASCII after I read #441, but it didn't help.

Looking at the creates.log on my Graphite server I can see some of those corrupted names coming in:

10/10/2013 03:52:28 :: new metric autoВOU^B<F8><FF><FF>p<F8><DF>^E<80><FA><FF><FF>odernui.collection.Se2-RunMediatorCommand._default.padre-elapsed-time.mean matched schema default_1min_for_1day
10/10/2013 03:52:28 :: new metric autoВOU^B<F8><FF><FF>p<F8><DF>^E<80><FA><FF><FF>odernui.collection.Se2-RunMediatorCommand._default.padre-elapsed-time.mean matched aggregation schema default_average
10/10/2013 03:52:28 :: creating database file /opt/graphite/storage/whisper/autoВOU^B<F8><FF><FF>p<F8><DF>^E<80><FA><FF><FF>odernui/collection/Se2-RunMediatorCommand/_default/padre-elapsed-time/mean.wsp (archive=[(60, 720), (300, 1152), (1800, 672), (10800, 720)] xff=0.5 agg=average)
...
11/10/2013 06:43:01 :: new metric au0<98>OU^B<F8><FF><FF>0<98>OU^B<F8><FF><FF>.modernui.collection.documentation._default.total-matching.mean matched schema default_1min_for_1day
11/10/2013 06:43:01 :: new metric au0<98>OU^B<F8><FF><FF>0<98>OU^B<F8><FF><FF>.modernui.collection.documentation._default.total-matching.mean matched aggregation schema default_average
11/10/2013 06:43:01 :: creating database file /opt/graphite/storage/whisper/au0<98>OU^B<F8><FF><FF>0<98>OU^B<F8><FF><FF>/modernui/collection/documentation/_default/total-matching/mean.wsp (archive=[(60, 720), (300, 1152), (1800, 672), (10800, 720)] xff=0.5 agg=average)

That seems to suggest that the problem is coming from the metric sent, not from Graphite itself. It seem to affect different metrics, of different types.

I'm not sure what to do next to try to diagnose this issue?

Thanks,

Nico

ryantenney commented 10 years ago

Is this issue still occurring? I haven't heard any other reports of this and have been unable to reproduce.

nguillaumin commented 10 years ago

Hi, sorry for the delay.

I'm running 3.0.1 with this issue, do you have any reason to believe it would be fixed by a change in 3.0.2?

I'll try to upgrade and let you know.

nguillaumin commented 10 years ago

I upgraded to 3.0.2 but unfortunately that's still happening.

I agree it's difficult to reproduce, that might happen only under stress as I can't reproduce it either on my local development environment.

nguillaumin commented 10 years ago

One thing I noticed by increasing logging is that there are some connection problems to my Graphite server from time to time:

2014-04-01 15:42:48,285 [metrics-graphite-reporter-thread-1] WARN  graphite.GraphiteReporter  - Unable to report to Graphite
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at java.net.Socket.connect(Socket.java:528)
        at java.net.Socket.<init>(Socket.java:425)
        at java.net.Socket.<init>(Socket.java:241)
        at javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277)
        at com.codahale.metrics.graphite.Graphite.connect(Graphite.java:71)
        at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:153)
        at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:116)
        at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:87)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

Maybe there's an error handling issue here that would corrupt subsequent data submissions?

Looking at GraphiteReporter.report(), it might be possible that graphite.connect(); blocks while the connection is established. If that takes more than the reporting interval (1mn in my case) then another thread would call report() again, possibly stepping on the toes of the previous one?

ryantenney commented 10 years ago

The ScheduledReporter (superclass of the GraphiteReporter) uses a single-threaded scheduled executor, making that impossible. The only way that could be possible is if you're reusing the Graphite instance.

holyjak commented 10 years ago

@nguillaumin You might want to check the data as they arrive to the server with f.ex. graphite-server$ sudo ngrep -d any port 2003 (provided Carbon runs at 2003; add more filtering to limit it further, f.ex. sudo ngrep -d any <substring expected in the packet> port 2003).

nguillaumin commented 10 years ago

Thanks for the suggestion, that's a good idea, however I think I already achieved the same effect by looking at the carbon-cache creates.log that shows what Graphite is receiving, and that seem to be corrupted (see the initial description of the bug report)

I tried it anyway with tcpdump and it seem to confirm that. See for example this batch of metrics that were submitted, they're mostly fine but one of them is corrupted:

auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-Java-Encodings._default.total-matching.p98 2.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-Java-Encodings._default.total-matching.p99 2.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-Java-Encodings._default.total-matching.p999 2.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.count 1 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.max 31 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.mean 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.min 31 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.stddev 0.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p50 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p75 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p95 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p98 31.00 1^@^@ ^@Free4
auto-w0ÅOþ^Cøÿÿ^P^F>^E<80>úÿÿernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p99 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.padre-elapsed-time.p999 31.00 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.total-matching.count 1 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.total-matching.max 31 1399380474
auto-win2012-64bit.modernui.Public.collection.Se2-PublicUI-QueryCompletion._default.total-matching.mean 31.00 1399380474

A couple of other examples:

auto-win2012-64bit.modernui.Public.collection.Se2-MultiServer-auto-win2012-64bit._default.queries.m5_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServer-auto-win2012-64bit._default.queries.m15_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServer-auto-win2012-64bit._default.queries.mean_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2^@^@Ð^@FreerverMeta0ÈOþ^Cøÿÿ^P¢'^F<80>úÿÿbit._default.queries.count 2 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServerMeta-auto-win2012-64bit._default.queries.m1_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServerMeta-auto-win2012-64bit._default.queries.m5_rate 0.00 1399385094
auto-win2012-64bit.modernui.Public.collection.Se2-MultiServerMeta-auto-win2012-64bit._default.queries.m15_rate 0.00 1399385094
auto-centos5-64bit.modernui.Admin.collection.Se2-MediatorPushInstantUpdate-auto-centos5-64bit._default.queries.m5_rate 0.00 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-MediatorPushInstantUpdate-auto-centos5-64bit._default.queries.m15_rate 0.00 1399385204
auto-centos5-64bit.moderntìhS¦V^L^@<8a>^\^@^@<8a>^\^@^@^@%dNsä^@^L)^K<94>²^H^@E^@^\|Z @^@@^F ÅÀ¨^Q#À¨^Q:Üù^GÓâÑ(n^Cß<91>¦<80>^P^@.À^\^@^@^A^A^H
.@ïiJ[­cui.Admin.collection.Se2-MediatorPushInstantUpdate-auto-centos5-64bit._default.queries.mean_rate 0.00 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-Meta-Hidden-ManageCollectionOptions._default.queries.count 2 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-Meta-Hidden-ManageCollectionOptions._default.queries.m1_rate 0.00 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-Meta-Hidden-ManageCollectionOptions._default.queries.m5_rate 0.00 1399385204
auto-centos5-64bit.modernui.Admin.collection.Se2-Meta-Hidden-ManageCollectionOptions._default.queries.m15_rate 0.00 1399385204
nguillaumin commented 10 years ago

I'm seeing that on our Graphite test server. Since we've deployed it on production and everything works fine, so it must be something with the test environment. I rebuilt the server on a different hardware and it was fine for a couple of weeks but now I'm seeing corruption again.

In any case it doesn't seem to be related to the Metrics library, probably has more to do with the Graphite setup, so I'm happy for this one to be closed.

enaggar commented 5 years ago

I have had the same issue, and the cause was that I used the same socket for multiple graphite reporters, and thus different reported metrics got mixed up, the resolution was to use different socket per reporter