Closed softwaregravy closed 12 years ago
I'm digging deeper into this, and the same jobs take longer on subsequent runs, so maybe I need to be closing connections?
2012-03-19T01:57:17+00:00 app[worker.1]: Published 2767 records in 33.88697671890259 seconds .... 2012-03-19T02:06:01+00:00 app[worker.1]: Published 2767 records in 142.7950940132141 seconds
Found the: #shutdown command, which I wasn't using. Adding that to the end of my runs has given me the same 'first run performance'. I did not find the bottleneck. For reference, my jobs ran 10 min apart. Server connecting to is Tomcat7 which has connection expiration configured for never.
Closing in favor of https://github.com/drbrain/net-http-persistent/pull/20
PS, thanks for the library!
Typically the "too many connection resets" error results when the server has shut down the connection then you try to POST as HTTP does not provide a notification that the connection has terminated, and HTTP does not allow a POST request to be retried by default.
Since you're receiving a Timeout::Error instead, it seems that the server has stopped responding, but has left the connection open, or perhaps after 748,740 requests you wore the poor server out. (I did not notice the amount of requests in the other tickets.)
When you receive other errors like this are they after equally large amounts of requests?
Does adding a manual rescue and retry allow your process to continue?
How many HTTP requests do your worker processes typically make?
it's a batch process that runs every 15 min or so. It makes 10's of thousands.
It was not stopping the process, it was just getting very very slow.
Does your worker CPU consumption increase as more requests are made?
If this is the case, it's possible mechanize, net-http-persistent, or your code are holding on to some state for too long.
If the CPU time stays the same or becomes reduced, you're probably exposing a bug in the server and #shutdown is the best workaround at present.
PS: I had no idea anyone was using net-http-persistent (or mechanize) this heavily! I'm a little glad to hear of a failure at such scale!
not sure. It's running in Heroku, so I don't monitor cpu, just throughput. My apps have always been governed by the database.
PS good work on building something that scales so well. Can't wait to try Pipelining.
Does heroku ps
show anything useful? (I don't know much about heroku, so there may not be enough information there)
Can you compare logs between client and server with heroku log
… if you're actually killing the server on one connection, I think pipelining is just going to kill it faster ☺
Also, if you're using mechanize, you should probably disable history with agent.max_history = 0
. The latest release of mechanize only stores the latest 50 pages, but earlier versions store unlimited pages which will consume lots of memory.
nothing useful RE CPU or anything else comes off heroku. They're a PAAS, so you're not supposed to be able to look under the hood.
Server won't die with this load. It's doing 100's of requests per second. The bottleneck is definitely on the Ruby side (or networking somehow). The volume of this thread is not significant to the load on the server.
Not using mechanize. All API's
I ran 2,000,000 requests using this script through apache on localhost retrieving a static page. Apache was configured to keep the persistent connection open forever:
require 'pp'
require 'net/http/persistent'
uri = URI.parse 'http://localhost/'
http = Net::HTTP::Persistent.new
pp GC.stat
1_000_000.times do
http.request uri
end
pp GC.stat
1_000_000.times do
http.request uri
end
pp GC.stat
The GC stats were (before, mid, after):
{:count=>11,
:heap_used=>77,
:heap_length=>77,
:heap_increment=>0,
:heap_live_num=>21434,
:heap_free_num=>9821,
:heap_final_num=>22}
{:count=>57088,
:heap_used=>77,
:heap_length=>77,
:heap_increment=>0,
:heap_live_num=>20742,
:heap_free_num=>10617,
:heap_final_num=>0}
{:count=>116494,
:heap_used=>77,
:heap_length=>77,
:heap_increment=>0,
:heap_live_num=>20741,
:heap_free_num=>10618,
:heap_final_num=>0}
So there were 2318 extra GC runs in the second million requests, but no growth in the number of objects ruby was holding on to, and no growth in memory as reported by Activity Monitory.
I also ran with this patch to net/http/persistent.rb:
diff --git a/lib/net/http/persistent.rb b/lib/net/http/persistent.rb
index a154f98..c9f7e3e 100644
--- a/lib/net/http/persistent.rb
+++ b/lib/net/http/persistent.rb
@@ -789,7 +789,8 @@ class Net::HTTP::Persistent
connection_id = connection.object_id
begin
- Thread.current[@request_key][connection_id] += 1
+ num = Thread.current[@request_key][connection_id] += 1
+ warn "#{num}\t#{connection_id}\t#{Time.now.to_i}" if num % 1_000 == 0
response = connection.request req, &block
if connection_close?(req) or
I then ran linear regression on the data and the linear fit line was y = 0.0009x + 1.332E+9
with R-squared of 1.0127, so it seems that your slowdown is either above Net::HTTP::Persistent or on the server side, unless you can think of some better tests for me to run.
(On my laptop, 2,000,000 requests took 1,729 seconds, or 1,157 requests/second. ab -n 100000 -k http://localhost
reports 4923 requests/second, a ~4x speedup)
Also, a scatter plot of the line (2000 data points) appeared linear in Numbers.
Off the top of my head, I'd ask how it behave with errors sprinkled into the responses? Just like 500's 1% of the time or something. Shouldn't change anything theoretically, I know.
I'm connecting to my servers through ElasticLoadBalancer, not directly to Tomcat, so there might be some funkiness going on there. I can't imagine there could be a problem with Tomcat. We're using persistent connections with other services without problems, but none of the other clients are Ruby.
Anecdotally, I'm now doing my calls in runs of 3000 requests at a time. Since adding the shutdown to the end of my runs, we've seen no latency increase over time -- it's running quite stable now. Adding the call to shutdown was conclusively the fix for our slow-down symptom, I wish I had a better way to help diagnose the problem.
It is also possible that the problem has something to do with the Heroku platform. It's a shared, virtualized environment, so it's possible I was hitting on an edge case around their network controllers or maybe they have some proprietary GC settings.
Also, we had pauses in our runs of maybe 5-10 min. Maybe something with reconnecting? When we first started, we were trying to do almost 100K per run, and then a few min later, same thread, try the next 100K.
500s won't make any difference in the behavior since they're not connection-level errors.
Could you run a test with my script and patch above through ElasticLoadBalancer to both a static page and Tomcat, both from heroku and from another host?
If you're using the latest release, by default, if the connection was idle for more than 5 seconds it will be shut down and recreated, so a few minutes between batch jobs shouldn't allow a connection to stay live for that long.
Also, with the default idle timeout, I'm not sure how you're getting up to 750K requests if you're running batches of 100K with a few minutes in between. I checked, and a connection unused longer than the idle timeout is reset, and the reset method clears the request count, so something isn't adding up here.
I'm glad you have a workaround for this mysterious behavior, but I'm at my limit of guessing (or testing) what is happening in your environment. If you don't wish to invest more effort into debugging this I understand.
We've just been bitten by this error. It doesn't happen in our development or staging (probably due to low-volume API requests) but happens in production.
Our environment:
Because of the number of different subdomains and Passenger workers, I'm guessing that our connection pool grows quite large.
Our NH::Persistent settings are default (idle timeout of 5 seconds).
These are the errors we're getting:
(Net::HTTP::Persistent::Error) "too many connection resets (due to execution expired - Timeout::Error) after 0 requests on 68646300, last used 1351321723.97081 seconds ago"
(Net::HTTP::Persistent::Error) "too many connection resets (due to execution expired - Timeout::Error) after 0 requests on 68335920, last used 1351321724.16949 seconds ago"
(Net::HTTP::Persistent::Error) "too many connection resets (due to execution expired - Timeout::Error) after 0 requests on 70265940, last used 1351311248.40204 seconds ago"
(Net::HTTP::Persistent::Error) "too many connection resets (due to end of file reached - EOFError) after 6 requests on 70259960, last used 4.979216 seconds ago"
The "zillion seconds ago" message looks like a bug in NH::Persistent, because it is the number of seconds since EPOCH.
Can it be that we saturated the API service with open connections?
My error is getting raised here. https://github.com/drbrain/net-http-persistent/blob/master/lib/net/http/persistent.rb#L821
Full error: "{too many connection resets (due to Timeout::Error - Timeout::Error) after 748740 requests on 69326240, last used 60.068056358 seconds ago\n/app/vendor/bundle/ruby/1.9.1/gems/net-http-persistent-2.5.2/lib/net/http/persistent.rb:821:i
Just trying to understand what I'm doing wrong. I'm probably mis-using the library somehow.
Do I need to be periodically closing connections? Won't they just re-open on an error like this?
Mechanize seems to have a similar issue. It's a long thread, I'm trying to see if there's an overlap. https://github.com/tenderlove/mechanize/issues/123