Closed wjordan closed 6 years ago
I don't see an obvious care where we are adding keys to the hash during iteration. The @pool
hash is protected by a mutex and is only enumerated or indexed behind that mutex. The stack trace is terminating inside the hash's block that yields for missing keys.
I can rewrite the code to avoid using the default block for assigning the list. This should be a simple change that I can ship with the next release so you can test the change.
This change should go out with our next release. Once this is live, can you update and let me know if this resolves the issue you've been experiencing?
Thanks for looking into this! I will update as soon as a new release is cut.
Unfortunately since the issue is very intermittent (most recently we saw ~2 weeks between occurrences on our production traffic- we logged 692 errors on 11/4, then 139 errors on 11/17), It will be difficult to let you know if it definitively resolves the issue. However I will be sure to update here if the error ever occurs again after updating.
Going to go ahead and close this for now, but feel free to reopen if you see this again after upgrading to the latest version.
Hi there! I'm a colleague of @wjordan. This issue has recurred for us using aws-sdk-core
2.9.14. Here's a recent stacktrace:
RuntimeError: can't add a new key into hash during iteration
101 # No error raised? Good, check the session into the pool.
102 @pool_mutex.synchronize do
103 @pool[endpoint] = [] unless @pool.key?(endpoint)
104 @pool[endpoint] << session
105 end
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/net_http/connection_pool.rb:103 :in `block in session_for`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/net_http/connection_pool.rb:102 :in `synchronize`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/net_http/connection_pool.rb:102 :in `session_for`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/net_http/handler.rb:119 :in `session`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/net_http/handler.rb:71 :in `transmit`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/net_http/handler.rb:45 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/plugins/content_length.rb:12 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_request_signer.rb:88 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_request_signer.rb:23 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/xml/error_handler.rb:8 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/helpful_socket_errors.rb:10 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_request_signer.rb:65 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_redirects.rb:15 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/retry_errors.rb:88 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_dualstack.rb:32 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_accelerate.rb:49 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_md5s.rb:31 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_expect_100_continue.rb:21 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_bucket_name_restrictions.rb:12 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_bucket_dns.rb:31 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/rest/handler.rb:7 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/user_agent.rb:12 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/plugins/endpoint.rb:41 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/param_validator.rb:21 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/plugins/raise_response_errors.rb:14 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_sse_cpk.rb:19 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_dualstack.rb:24 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/s3_accelerate.rb:34 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:20 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/idempotency_token.rb:18 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/aws-sdk-core/plugins/param_converter.rb:20 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/plugins/response_target.rb:21 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/request.rb:70 :in `send_request`
[GEM_ROOT]/gems/aws-sdk-core-2.9.14/lib/seahorse/client/base.rb:207 :in `block (2 levels) in define_operation_methods`
[...Application code stacktrace...]
Let's reopen and take a look...
Is there any additional information we can provide to help investigate the error? Thanks!
It's just interesting because Trevor's original comment about the mutexes seems to still apply. If you have some sort of scenario that can reliably recreate this (if possible) that may help.
Additionally, how frequently are you seeing this issue currently?
It occurs periodically in bursts. For example, we had this happen not at all for the 2 weeks leading up to June 13. Then it happened 252 times on June 13 and 119 times on June 14. Then not at all for a few days, then 511 times on June 19.
When this does happen, it appears to be in a concentrated burst at a rate of about 1-2 faults per minute. They have all been on the same front end server within a burst as far as I can tell, so something about state of the gem or the server, rather than some externality such as network. (This is running on an EC2 instance within AWS VPC, btw.)
Even with the intermittency, this happens enough that this is presently our top operational issue in terms of faults/week.
This is an issue where I'd encourage an upgrade to V3 of the SDK. It is generally better with concurrency, and while I can't guarantee it fixes this issue, it should make it a lot easier to narrow down with the more detailed stack traces.
Closing soon if this remains inactive, I'd love to know if V3 doesn't resolve this.
We haven't upgraded to V3 yet, but I will follow up here when we do complete an upgrade. This is still affecting our production application- the last burst was 1,232 instances of this error on Apr 29, running version 2.10.79.
Hi there! This issue just occurred again for us using aws-sdk-core
3.22.1, aws-sdk-firehose
1.4.0. We saw a burst of 600 instances of this error in a 24-hour window this weekend, September 15-16.
We're currently investigating available gem upgrades to see if they help:
Here's a recent backtrace:
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/net_http/connection_pool.rb:107 :in `block in session_for`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/net_http/connection_pool.rb:106 :in `synchronize`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/net_http/connection_pool.rb:106 :in `session_for`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/net_http/handler.rb:121 :in `session`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/net_http/handler.rb:73 :in `transmit`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/net_http/handler.rb:47 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/plugins/content_length.rb:12 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/json/error_handler.rb:8 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/signature_v4.rb:66 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/helpful_socket_errors.rb:10 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/retry_errors.rb:138 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/json/handler.rb:11 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/user_agent.rb:13 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/plugins/endpoint.rb:45 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/param_validator.rb:24 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/plugins/raise_response_errors.rb:14 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:20 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/idempotency_token.rb:17 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/param_converter.rb:24 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/aws-sdk-core/plugins/response_paging.rb:10 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/plugins/response_target.rb:23 :in `call`
[GEM_ROOT]/gems/aws-sdk-core-3.22.1/lib/seahorse/client/request.rb:70 :in `send_request`
[GEM_ROOT]/gems/aws-sdk-firehose-1.4.0/lib/aws-sdk-firehose/client.rb:1040 :in `put_record`
[Application repo]/lib/cdo/firehose.rb:52 :in `put_record`
Firehose showing up in the backtrace is new, and we're investigating to see if there's something we're doing wrong in our application layer that might cause this, or any infrastructure-level events that might explain it. In the meantime we figured a backtrace from the v3 sdk might be helpful. Thank you!
Let's reopen and take a look then. One other thing that may be relevant, can you tell us a bit more about how you're doing concurrency in your application? It may help us to attempt to reproduce your issue.
@islemaster
Do you have any additional details that would help us reproduce the issue?
Closing this issue. Happy to re-open if there are additional details to reproduce.
We had the same issue, in aws-sdk-core (3.104.4)
We used passenger as web server with smart spawning (which is default). We suspected this issue is due to copy-on-write used in forking process. You can refer this link on how passenger does this. You can try using direct spawning if you have similar setup. In our case we made changes in our code, cleared all connection pool in fork hook and these errors stopped completely
Hi, I just started seeing this issue with version aws-sdk-core (3.180.3)
while ramping up concurrency with puma.
This is the stack trace that we got:
at block in session_for(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/net_http/connection_pool.rb:111)
at synchronize(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/net_http/connection_pool.rb:110)
at session_for(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/net_http/connection_pool.rb:110)
at session(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/net_http/handler.rb:128)
at transmit(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/net_http/handler.rb:76)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/net_http/handler.rb:50)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/plugins/content_length.rb:24)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/plugins/request_callback.rb:87)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-dynamodb-1.93.1/lib/aws-sdk-dynamodb/plugins/crc32_validation.rb:32)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/json/error_handler.rb:10)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/sign.rb:49)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/transfer_encoding.rb:26)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/helpful_socket_errors.rb:12)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/retry_errors.rb:360)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/user_agent.rb:37)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/http_checksum.rb:19)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/endpoint_pattern.rb:30)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/checksum_algorithm.rb:136)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/request_compression.rb:94)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/json/handler.rb:13)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/recursion_detection.rb:18)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-dynamodb-1.93.1/lib/aws-sdk-dynamodb/plugins/endpoints.rb:41)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/endpoint_discovery.rb:84)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/plugins/endpoint.rb:47)
at block in call(/usr/local/bundle/ruby/3.0.0/gems/ddtrace-0.54.2/lib/ddtrace/contrib/aws/instrumentation.rb:21)
at trace(/usr/local/bundle/ruby/3.0.0/gems/ddtrace-0.54.2/lib/ddtrace/tracer.rb:283)
at call(/usr/local/bundle/ruby/3.0.0/gems/ddtrace-0.54.2/lib/ddtrace/contrib/aws/instrumentation.rb:20)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/param_validator.rb:26)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/logging.rb:41)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/plugins/raise_response_errors.rb:16)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-dynamodb-1.93.1/lib/aws-sdk-dynamodb/plugins/simple_attributes.rb:119)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/idempotency_token.rb:19)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/param_converter.rb:26)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/plugins/request_callback.rb:71)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/aws-sdk-core/plugins/response_paging.rb:12)
at call(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/plugins/response_target.rb:24)
at send_request(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-core-3.180.3/lib/seahorse/client/request.rb:72)
at describe_table(/usr/local/bundle/ruby/3.0.0/gems/aws-sdk-dynamodb-1.93.1/lib/aws-sdk-dynamodb/client.rb:3008)
Are you still seeing this issue? Are there any ideas on how to fix it?
We had encountered the same issue.
As a result of our investigation, I succeeded in writing a small reproduction code.
require 'aws-sdk-s3'
require 'aws-sdk-dynamodb'
Aws::S3::Client.new.list_buckets
puts Seahorse::Client::NetHttp::ConnectionPool.pools.first.instance_variable_get(:@pool)
# {"https://s3.ap-northeast-1.amazonaws.com"=>[#<Net::HTTP s3.ap-northeast-1.amazonaws.com:443 open=true>]}
Thread.new do
pool = Seahorse::Client::NetHttp::ConnectionPool.pools.first
loop { pool.size } # iterate Seahorse::Client::NetHttp::ConnectionPool @pool forever
end
sleep 0.1 # wait a miniute for the thread to start
fork do
Aws::DynamoDB::Client.new.list_tables
# => can't add a new key into hash during iteration (RuntimeError)
# error occurs when trying `@pool["https://dynamodb.ap-northeast-1.amazonaws.com"] = []`
end
This program raises RuntimeError: can't add a new key into hash during iteration
!
This may be due to a race condition between the parent and child process when using fork(2) system call.
How to fix? Please add a line the begging of the forked process for clearing the seahorse connection pools. Like following:
fork do
Aws.empty_connection_pools!
...
end
Thanks for adding a reproduction for this. I understand this is some long standing bug. I was able to reproduce, but then fix, when I change the size method to not iterate (just make a call to @pool.values.flatten.size
) However I assume this would still happen because there are other methods that iterate the pool with each_pair
, such as empty!
and _clean
. My understanding is, since fork
makes a copy of the mutex, synchronize does not help here. Fork may be copying the state during the iteration.
If it helps, I can push a fix to that size method, but I can't guarantee you will stop seeing this?
On our production application we are receiving intermittent stack traces on S3 requests- the errors arrive in bursts of a couple hundred over an hour or two, then nothing for a few days, then another burst (6 over the last 30 days). I suspect this might be some sort of concurrency issue related to network-error retries somewhere deep in the
seahorse
stack, but could use some help figuring out what exactly is going on and how to fix.Using
aws-sdk-core
2.6.1.Relevant part of the stack trace (
net_http
):block in initialize
yield
block in session_for
synchronize
session_for
session
transmit
call
Full
aws-sdk
stack trace with all handlers:Reproducing the issue with debug logging would be challenging (since the issue only occurs intermittently in production), so I'm hoping to track this down by inspecting the codebase directly.