drbrain / net-http-persistent

Thread-safe persistent connections with Net::HTTP
http://seattlerb.rubyforge.org/net-http-persistent
339 stars 117 forks source link

Help with "too many connection resets" error #19

Closed softwaregravy closed 12 years ago

softwaregravy commented 12 years ago

My error is getting raised here. https://github.com/drbrain/net-http-persistent/blob/master/lib/net/http/persistent.rb#L821

Full error: "{too many connection resets (due to Timeout::Error - Timeout::Error) after 748740 requests on 69326240, last used 60.068056358 seconds ago\n/app/vendor/bundle/ruby/1.9.1/gems/net-http-persistent-2.5.2/lib/net/http/persistent.rb:821:i

Just trying to understand what I'm doing wrong. I'm probably mis-using the library somehow.

Do I need to be periodically closing connections? Won't they just re-open on an error like this?

Mechanize seems to have a similar issue. It's a long thread, I'm trying to see if there's an overlap. https://github.com/tenderlove/mechanize/issues/123

softwaregravy commented 12 years ago

I'm digging deeper into this, and the same jobs take longer on subsequent runs, so maybe I need to be closing connections?

2012-03-19T01:57:17+00:00 app[worker.1]: Published 2767 records in 33.88697671890259 seconds .... 2012-03-19T02:06:01+00:00 app[worker.1]: Published 2767 records in 142.7950940132141 seconds

softwaregravy commented 12 years ago

Found the: #shutdown command, which I wasn't using. Adding that to the end of my runs has given me the same 'first run performance'. I did not find the bottleneck. For reference, my jobs ran 10 min apart. Server connecting to is Tomcat7 which has connection expiration configured for never.

softwaregravy commented 12 years ago

Closing in favor of https://github.com/drbrain/net-http-persistent/pull/20

softwaregravy commented 12 years ago

PS, thanks for the library!

drbrain commented 12 years ago

Typically the "too many connection resets" error results when the server has shut down the connection then you try to POST as HTTP does not provide a notification that the connection has terminated, and HTTP does not allow a POST request to be retried by default.

Since you're receiving a Timeout::Error instead, it seems that the server has stopped responding, but has left the connection open, or perhaps after 748,740 requests you wore the poor server out. (I did not notice the amount of requests in the other tickets.)

When you receive other errors like this are they after equally large amounts of requests?

Does adding a manual rescue and retry allow your process to continue?

How many HTTP requests do your worker processes typically make?

softwaregravy commented 12 years ago

it's a batch process that runs every 15 min or so. It makes 10's of thousands.

It was not stopping the process, it was just getting very very slow.

drbrain commented 12 years ago

Does your worker CPU consumption increase as more requests are made?

If this is the case, it's possible mechanize, net-http-persistent, or your code are holding on to some state for too long.

If the CPU time stays the same or becomes reduced, you're probably exposing a bug in the server and #shutdown is the best workaround at present.

PS: I had no idea anyone was using net-http-persistent (or mechanize) this heavily! I'm a little glad to hear of a failure at such scale!

softwaregravy commented 12 years ago

not sure. It's running in Heroku, so I don't monitor cpu, just throughput. My apps have always been governed by the database.

PS good work on building something that scales so well. Can't wait to try Pipelining.

drbrain commented 12 years ago

Does heroku ps show anything useful? (I don't know much about heroku, so there may not be enough information there)

Can you compare logs between client and server with heroku log

… if you're actually killing the server on one connection, I think pipelining is just going to kill it faster ☺

drbrain commented 12 years ago

Also, if you're using mechanize, you should probably disable history with agent.max_history = 0. The latest release of mechanize only stores the latest 50 pages, but earlier versions store unlimited pages which will consume lots of memory.

softwaregravy commented 12 years ago

nothing useful RE CPU or anything else comes off heroku. They're a PAAS, so you're not supposed to be able to look under the hood.

Server won't die with this load. It's doing 100's of requests per second. The bottleneck is definitely on the Ruby side (or networking somehow). The volume of this thread is not significant to the load on the server.

Not using mechanize. All API's

drbrain commented 12 years ago

I ran 2,000,000 requests using this script through apache on localhost retrieving a static page. Apache was configured to keep the persistent connection open forever:


require 'pp'
require 'net/http/persistent'

uri = URI.parse 'http://localhost/'

http = Net::HTTP::Persistent.new

pp GC.stat

1_000_000.times do
  http.request uri
end

pp GC.stat

1_000_000.times do
  http.request uri
end

pp GC.stat

The GC stats were (before, mid, after):

{:count=>11,
 :heap_used=>77,
 :heap_length=>77,
 :heap_increment=>0,
 :heap_live_num=>21434,
 :heap_free_num=>9821,
 :heap_final_num=>22}

{:count=>57088,
 :heap_used=>77,
 :heap_length=>77,
 :heap_increment=>0,
 :heap_live_num=>20742,
 :heap_free_num=>10617,
 :heap_final_num=>0}

{:count=>116494,
 :heap_used=>77,
 :heap_length=>77,
 :heap_increment=>0,
 :heap_live_num=>20741,
 :heap_free_num=>10618,
 :heap_final_num=>0}

So there were 2318 extra GC runs in the second million requests, but no growth in the number of objects ruby was holding on to, and no growth in memory as reported by Activity Monitory.

I also ran with this patch to net/http/persistent.rb:

diff --git a/lib/net/http/persistent.rb b/lib/net/http/persistent.rb
index a154f98..c9f7e3e 100644
--- a/lib/net/http/persistent.rb
+++ b/lib/net/http/persistent.rb
@@ -789,7 +789,8 @@ class Net::HTTP::Persistent
     connection_id = connection.object_id

     begin
-      Thread.current[@request_key][connection_id] += 1
+      num = Thread.current[@request_key][connection_id] += 1
+      warn "#{num}\t#{connection_id}\t#{Time.now.to_i}" if num % 1_000 == 0
       response = connection.request req, &block

       if connection_close?(req) or

I then ran linear regression on the data and the linear fit line was y = 0.0009x + 1.332E+9 with R-squared of 1.0127, so it seems that your slowdown is either above Net::HTTP::Persistent or on the server side, unless you can think of some better tests for me to run.

(On my laptop, 2,000,000 requests took 1,729 seconds, or 1,157 requests/second. ab -n 100000 -k http://localhost reports 4923 requests/second, a ~4x speedup)

drbrain commented 12 years ago

Also, a scatter plot of the line (2000 data points) appeared linear in Numbers.

softwaregravy commented 12 years ago

Off the top of my head, I'd ask how it behave with errors sprinkled into the responses? Just like 500's 1% of the time or something. Shouldn't change anything theoretically, I know.

I'm connecting to my servers through ElasticLoadBalancer, not directly to Tomcat, so there might be some funkiness going on there. I can't imagine there could be a problem with Tomcat. We're using persistent connections with other services without problems, but none of the other clients are Ruby.

Anecdotally, I'm now doing my calls in runs of 3000 requests at a time. Since adding the shutdown to the end of my runs, we've seen no latency increase over time -- it's running quite stable now. Adding the call to shutdown was conclusively the fix for our slow-down symptom, I wish I had a better way to help diagnose the problem.

It is also possible that the problem has something to do with the Heroku platform. It's a shared, virtualized environment, so it's possible I was hitting on an edge case around their network controllers or maybe they have some proprietary GC settings.

softwaregravy commented 12 years ago

Also, we had pauses in our runs of maybe 5-10 min. Maybe something with reconnecting? When we first started, we were trying to do almost 100K per run, and then a few min later, same thread, try the next 100K.

drbrain commented 12 years ago

500s won't make any difference in the behavior since they're not connection-level errors.

Could you run a test with my script and patch above through ElasticLoadBalancer to both a static page and Tomcat, both from heroku and from another host?

If you're using the latest release, by default, if the connection was idle for more than 5 seconds it will be shut down and recreated, so a few minutes between batch jobs shouldn't allow a connection to stay live for that long.

Also, with the default idle timeout, I'm not sure how you're getting up to 750K requests if you're running batches of 100K with a few minutes in between. I checked, and a connection unused longer than the idle timeout is reset, and the reset method clears the request count, so something isn't adding up here.

I'm glad you have a workaround for this mysterious behavior, but I'm at my limit of guessing (or testing) what is happening in your environment. If you don't wish to invest more effort into debugging this I understand.

mislav commented 12 years ago

We've just been bitten by this error. It doesn't happen in our development or staging (probably due to low-volume API requests) but happens in production.

Our environment:

Because of the number of different subdomains and Passenger workers, I'm guessing that our connection pool grows quite large.

Our NH::Persistent settings are default (idle timeout of 5 seconds).

These are the errors we're getting:

(Net::HTTP::Persistent::Error) "too many connection resets (due to execution expired - Timeout::Error) after 0 requests on 68646300, last used 1351321723.97081 seconds ago"

(Net::HTTP::Persistent::Error) "too many connection resets (due to execution expired - Timeout::Error) after 0 requests on 68335920, last used 1351321724.16949 seconds ago"

(Net::HTTP::Persistent::Error) "too many connection resets (due to execution expired - Timeout::Error) after 0 requests on 70265940, last used 1351311248.40204 seconds ago"

(Net::HTTP::Persistent::Error) "too many connection resets (due to end of file reached - EOFError) after 6 requests on 70259960, last used 4.979216 seconds ago"

The "zillion seconds ago" message looks like a bug in NH::Persistent, because it is the number of seconds since EPOCH.

Can it be that we saturated the API service with open connections?