Busy-wait bug in v1.1 client input threads

Rubberman commented 12 years ago

Testing OpenTSDB v1.1 on a Cloudera CD4H cluster in our performance lab (running HBase 0.92.1), we were pushing data at high rates (5000-10000 points/sec) per thread. Normally each thread runs at 8-17% cpu/core (depending upon rate), but when there is a momentary delay from HBase (due to data flushing or minor compaction activities), or a lack of HBase RPC threads, the TSDB client thread will bounce to 100% cpu/core utilization and the speed of accepting data will drop to about 1/5 the desired speed. It never recovers until the client pushing the data disconnects from TSDB. I removed netcat from the stream and implemented a method to detect this behavior in my load testing tool, subsequently disconnecting and reconnecting the socket to TSDB. That does allow the data rates to be maintained. My testing tool does make sure it only sends data at the specified rate, sleeping between messages as necessary. When the average time to build and send a message (using blocking writes to TSDB) exceeds the specified time for more than 100 sequential messages, it assumes that TSDB is in the busy-wait state and only then it does the socket reset thing. It seems to be very effective. Running 4 threads (2 each to 2 TSDB instances running on separate servers) at 5000 points per second each overnight, we only had a total of 8 such such events, and the system was still stable at the end of the test. Each thread handled a total of over 265M data points in 14 3/4 hours.

We used to have a similar problem with our Jetty-based application servers under load when a socket would get into a half-closed state that the software could not detect. Jetty would continue to think the socket had data, but it would only read random gunk that it would pass back to our data handler, endlessly. We were able to write code to detect that and then close the offending socket. I don't know if this is related to this bug, but I just wanted to mention it in case that helps isolate the problem. If necessary, we will fix it ourselves before we go into production, though for testing purposes I am implementing my work-around in our data collectors/filters so we can start utilizing OpenTSDB in our performance and other testing labs, as well as starting our data center deployment planning.

tsuna commented 11 years ago

When you see the problem, can you run jstack -l <pid-of-tsd> 2-3 times in a row and share the output? This could help pinpoint where the busy-wait loop is.

Rubberman commented 11 years ago

I'll try that. I can reproduce it at will. I can change one collector to not use the condition detection code (one parameter change it its config file) and then blast a days worth of sar data at it... :-) Should be able to do that tomorrow. I'll send you the output of jstack.

-Bill

On Mon, Nov 19, 2012 at 6:56 PM, Benoit Sigoure notifications@github.comwrote:

When you see the problem, can you run jstack -l 2-3 times in a row and share the output? This could help pinpoint where the busy-wait loop is.

— Reply to this email directly or view it on GitHubhttps://github.com/OpenTSDB/opentsdb/issues/120#issuecomment-10538594.

Rubberman commented 11 years ago

Anything else I can do to help you drill into this?

-BB

On Mon, Nov 19, 2012 at 7:03 PM, William Boyle woboyle@ieee.org wrote:

I'll try that. I can reproduce it at will. I can change one collector to not use the condition detection code (one parameter change it its config file) and then blast a days worth of sar data at it... :-) Should be able to do that tomorrow. I'll send you the output of jstack.

-Bill

On Mon, Nov 19, 2012 at 6:56 PM, Benoit Sigoure notifications@github.comwrote:

When you see the problem, can you run jstack -l 2-3 times in a row and share the output? This could help pinpoint where the busy-wait loop is.

— Reply to this email directly or view it on GitHubhttps://github.com/OpenTSDB/opentsdb/issues/120#issuecomment-10538594.

tsuna commented 11 years ago

Can you also share which JVM version and OS you're running on?

Rubberman commented 11 years ago

The JVM is that installed by the Cloudera manager/installer. The OS is CentOS 6.3, running latest kernel. This happens on all systems we have tried it on, and JVM versions. It happens on RHEL 5.x and 6.x systems identically.

-Bill

On Mon, Nov 19, 2012 at 7:11 PM, Benoit Sigoure notifications@github.comwrote:

Can you also share which JVM version and OS you're running on?

— Reply to this email directly or view it on GitHubhttps://github.com/OpenTSDB/opentsdb/issues/120#issuecomment-10538946.

tsuna commented 11 years ago

I just committed tsuna/opentsdb@9687ae9613c70849d669e9284694ee2abb2b2138 in my repo, can you please tell me if you're able to reproduce the problem with that branch?

Do you have clients that are connecting over localhost? If not over localhost, what's your MTU size?

Using the jpmp script can you run jpmp <pid-of-TSD> 20 0.1 and share the output? It would help see whether your CPU-bound threads are often in DelimiterBasedFrameDecoder.indexOf(), because I've identified a performance problem with the implementation of DelimiterBasedFrameDecoder which I fixed in my branch, and as the commit says I'm seeing 10x to 20x better performance now.

Rubberman commented 11 years ago

I will as soon as I can - tonight probably. Been working on other problems today and have a meeting soon. Will let you know.

-Bill

On Tue, Nov 20, 2012 at 12:39 PM, Benoit Sigoure notifications@github.comwrote:

I just committed tsuna/opentsdb@9687ae9https://github.com/tsuna/opentsdb/commit/9687ae9613c70849d669e9284694ee2abb2b2138in my repo, can you please tell me if you're able to reproduce the problem with that branch?

Do you have clients that are connecting over localhost? If not over localhost, what's your MTU size?

Using the jpmp scripthttp://blog.tsunanet.net/2010/08/jpmp-javas-poor-mans-profiler.htmlcan you run jpmp
20 0.1 and share the output? It would help see whether your CPU-bound threads are often in DelimiterBasedFrameDecoder.indexOf(), because I've identified a performance problem with the implementation of DelimiterBasedFrameDecoder which I fixed in my branch, and as the commit says I'm seeing 10x to 20x better performance now. — Reply to this email directly or view it on GitHubhttps://github.com/OpenTSDB/opentsdb/issues/120#issuecomment-10567000.

spahl commented 11 years ago

I had the same problem. I tried this patch and it's all gone now.

Thanks!

tsuna commented 11 years ago

I filed an issue with Netty (netty/netty#759) and provided a patch. Hopefully this improvement can make it upstream.

Rubberman commented 11 years ago

Excellent. When I get unburied from my current project, this will be top on my list to test. Thanks!

KalpanaChinnathambi commented 11 years ago

Good.Let me know how you done and tested.

tsuna commented 11 years ago

This issue is considered fixed. If you think you're seeing something similar, please make sure you're using the latest version of asynchbase (v1.4.1 was released yesterday) and file a new issue.

Rubberman commented 11 years ago

Tsuna,

I'm testing it now. It is working (I updated asynchbase to 1.4.1 as recommended) with the latest Cloudera Hadoop release - I think hbase is still at 92.1 (I'm using SNAPPY compression). I will be banging on it pretty hard tomorrow to see if it handles high loads without choking any longer (the netty fix). I'll let you know. If that works, it will let us really simplify our code that right now has to deal with the issue by detecting it as "data congestion", dropping the connection, and then reconnecting to tsdb. A kludge, but it lets us push several billion data points per day into a moderate size hadoop cluster.

Thanks for your efforts!

-Bill Boyle

OpenTSDB / opentsdb

Busy-wait bug in v1.1 client input threads #120