Closed yorickpeterse closed 7 years ago
@trevorrowe At least on JRuby 1.7 autoload
is not thread-safe as far as I'm aware of. I believe it wasn't until JRuby 9000 that it was made thread-safe, but @headius can probably fill you in on that. By the looks of it autoload
isn't thread-safe on Rubinius either, I'll do some digging to see if there's anything we can do on our end to fix that.
As for the signature errors, the only fix I understand that we've tried that resolves the issue is to put a global mutex around the OpenSSL digest methods. This seems like it shouldn't be necessary. Thoughts?
In comment https://github.com/aws/aws-sdk-ruby/issues/455#issuecomment-64717739 I discussed this and set up a standalone script that couldn't reproduce anything close to the problem discussed in this issue. The code we use for OpenSSL is pulled directly from MRI commit https://github.com/ruby/ruby/commit/5a58165520d5a429ab69f8d6d952a8ff645452bc. I still have to look into updating to the current version, I vaguely remember having problems last time I tried.
Either way, in V1 the problem was that the client/signature signing instances were shared between threads without any synchronisation in place. Is such a pattern still the case with V2?
What is not thread-safe about autoload in JRuby 1.7?
Those errors do not necessarily indicate that JRuby's autoload is unsafe. Constants can end up missing during any concurrent require logic, regardless of whether autoload is involved, if code attempts to access classes while they're still booting. This can happen if, for example, defined? is used to determine the existence of a class. This is because class definition is not atomic.
Of course it could also indicate a thread-safety problem in JRuby's autoload, but it's definitely not the first place I'd look.
Ok, looking at the code, there's an access of Base there, and Base is defined as an autoload...so there's a better chance this is a bug in autoload in JRuby. I'll see if I can come up with a case that reproduces it.
At least this script https://gist.github.com/YorickPeterse/2efb97451fd27c34aec7 fails on Rubinius with constant missing errors, haven't managed to get it to fail on JRuby yet.
In https://github.com/rubinius/rubinius/commit/b57399f8f44c9afbe5b2785abd85943a8f46ddd3 I took care of the autoloading problems (as far as I can tell). I finally managed to get my hands on a signature error using the scripts discussed in https://gist.github.com/YorickPeterse/2efb97451fd27c34aec7:
An exception occurred running repro.rb:
(Aws::SQS::Errors::SignatureDoesNotMatch)
Backtrace:
Seahorse::Client::Plugins::RaiseResponseErrors::Handler#call at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/raise_response_errors.rb:15
Seahorse::Client::Plugins::ParamConversion::Handler#call at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/param_conversion.rb:22
Aws::Plugins::ResponsePaging::Handler#call at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/aws-sdk-core/plugins/response_paging.rb:10
Seahorse::Client::Plugins::ResponseTarget::Handler#call at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/response_target.rb:18
Seahorse::Client::Request#send_request at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/request.rb:70
{ } in Aws::SQS::Client#define_operation_methods at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/base.rb:216
{ } in Object#__script__ at repro.rb:11
Proc#call at kernel/bootstrap/proc.rb:20
Thread#__run__ at kernel/bootstrap/thread.rb:356
This indicates that the script does eventually reproduce the problem given enough time and luck.
In a production application the autoloading problems are indeed resolved, sadly the signature errors remain (when using V2). I'll be looking into this today to see if I can solve this problem for good.
@YorickPeterse In production, are you using static credentials, or are you using refreshing credentials, such as an IAM instance profile, assume role credentials, STS session credentials, etc?
@trevorrowe I've tested this both on an EC2 instance (which uses IAM credentials) and locally (which uses static credentials). I'm currently looking into rubysl-openssl as I suspect it has some problems that might trigger this particular problem. Running my repro script mentioned above with -Xcapi.lock
(preventing concurrent C API calls) seems to prevent any errors from occurring, hence my suspicion.
We recently updated the OpenSSL code of Rubinius to match the code of the latest MRI version (rubysl-openssl is basically a 1:1 copy of MRI's OpenSSL code). Having looked at the code of V2 of the AWS SDK I couldn't find a point where mutable state was shared between threads, leading me to believe it might be OpenSSL that's broken.
To confirm/deny this I set up a new standalone script to try and trigger the problem:
require 'openssl'
Thread.abort_on_exception = true
input = 'the cake is possibly a lie'
digest = OpenSSL::Digest::SHA256.new
digest.update(input)
expected = digest.hexdigest
puts 'Starting threads...'
threads = 10.times.map do
Thread.new do
loop do
digest = OpenSSL::Digest::SHA256.new
digest.update(input)
got = digest.hexdigest
if got != expected
raise "Expected digest #{got.inspect} to equal #{expected.inspect}"
end
end
end
end
threads.each(&:join)
On MRI this will run fine for all eternity. On Rubinius on the other hand this pretty much instantly crashes with the following output:
Starting threads...
An exception occurred running /tmp/openssl_thread.rb:
Expected digest "70b9d40465992fa71a861e050481fb1a4bed5c8aa127272e7bf5838a3b0ab240" to equal "83103cff21a7a50e45eb90c29ca9d24204d95fe7369565ec1d5339750da386ab" (RuntimeError)
Backtrace:
{ } in Object#__script__ at /tmp/openssl_thread.rb:23
Kernel(Object)#loop at kernel/common/kernel.rb:511
{ } in Object#__script__ at /tmp/openssl_thread.rb:16
Proc#call at kernel/bootstrap/proc.rb:20
Thread#__run__ at kernel/bootstrap/thread.rb:356
Interesting enough the digest appears to always be "70b9d40465992fa71a861e050481fb1a4bed5c8aa127272e7bf5838a3b0ab240". However, when running Rubinius with a CAPI lock (using -Xcapi.lock
) the produced digest is "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", though the VM sometimes segfault in the OpenSSL code.
To cut a long story short, it seems the OpenSSL extension is beyond broken in multi-threaded environments or Rubinius does some weird things to it. I'll continue digging until I'm sure what's to blame. If it turns out to be unrelated to the AWS SDK itself I'll close this issue.
Per "rideliner" from the Rubinius Gitter channel, the above snippet is flawed. The local variable digest
in the Thread.new
block also overwrites the outer variable, leading to the race condition. Somehow I expected locals to at least be thread-local (even in this case). I shall get myself a dunce cap.
Either way, the actual signature problem still persists. I'll continue my investigation, but at least it seems that both rubysl-digest and rubysl-openssl are not the cause for these problems.
@trevorrowe I literally just got this error on MRI 2.2 as well (though this is using V1), which seems to rule our Rubinius itself being the problem. The error/backtrace is as following (as taken from Rollbar):
AWS::SQS::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
The Canonical String for this request should have been
'POST
/345153512707/qr_generator
content-length:175
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.63.0 ruby/2.2.2 x86_64-linux
x-amz-content-sha256:47236ab18c30db436023205b5b09bd472e527b24ade58bc3acc21ed4a0d6ad4c
x-amz-date:20150716T181615Z
x-amz-security-token:[REMOVED]
content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-security-token
[REMOVED]'
The String-to-Sign should have been
'AWS4-HMAC-SHA256
20150716T181615Z
20150716/eu-west-1/sqs/aws4_request
8b1031f744460308b9295139a5b1c3992e3bf83e66b70b699fd3479fe2d25c19'
Backtrace:
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/core/client.rb" line 375 in return_or_raise
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/core/client.rb" line 476 in client_request
File "(eval)" line 3 in receive_message
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/sqs/queue.rb" line 201 in receive_message
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/sqs/queue.rb" line 303 in block in poll
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/sqs/queue.rb" line 301 in loop
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/sqs/queue.rb" line 301 in poll
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/oni-3.1.0/lib/oni/daemons/sqs.rb" line 34 in receive
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/oni-3.1.0/lib/oni/daemon.rb" line 180 in run_thread
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/oni-3.1.0/lib/oni/daemon.rb" line 164 in block in spawn_thread
For the first time ever I also experienced this issue on JRuby 1.7:
AWS::SQS::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
The Canonical String for this request should have been
'POST
/345153512707/opener-opinion-detector-basic
content-length:192
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.64.0 jruby/1.9.3 java
x-amz-content-sha256:90638f76a52ee75c5210c44a615ba9e75278e3e54da88972f50cb680501308d6
x-amz-date:20150820T230324Z
x-amz-security-token:[REMOVED]
content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-security-token
[REMOVED]'
The String-to-Sign should have been
'AWS4-HMAC-SHA256
20150820T230324Z
20150820/eu-west-1/sqs/aws4_request
25907c2c04c518ad21ab5a3c66f1a31fa3eaacc87c62cc6aecf740cb44710c14'
Backtrace:
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/core/client.rb" line 375 in return_or_raise
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/core/client.rb" line 476 in client_request
File "(eval)" line 3 in receive_message
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/sqs/queue.rb" line 201 in receive_message
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/sqs/queue.rb" line 303 in poll
File "org/jruby/RubyKernel.java" line 1501 in loop
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/sqs/queue.rb" line 301 in poll
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/oni-3.1.1/lib/oni/daemons/sqs.rb" line 34 in receive
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/oni-3.1.1/lib/oni/daemon.rb" line 180 in run_thread
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/oni-3.1.1/lib/oni/daemon.rb" line 164 in spawn_thread
Ruby info:
jruby 1.7.16.1 (1.9.3p392) 2014-10-28 4e93f31 on OpenJDK 64-Bit Server VM 1.7.0_85-mockbuild_2015_07_20_19_47-b00 +jit [linux-amd64]
This issue has been inactive for over a year, and V1 is end of life. Happy to continue to look at this, given all the effort in place so far, if we get more information. For now, closing. Again, feel free to reopen if more information comes in.
Since switching a daemon over to Rubinius we've been getting quite a few of the following errors:
The corresponding stack trace:
The root frame of the error is our own code which creates an instance of
AWS::SQS
and then tries to retrieve a queue by its name. This happens in a threaded environment (10 threads to be exact) under Rubinius. MRI does not (so far) seem to be affected.It's not exactly clear to me what's causing it. The individual
AWS::SQS
and related instances are not shared between threads, nor are any of our own AWS related operations.As a temporary fix I disabled checksum validation for SQS operations but I'd rather not keep that disabled in the long run.
Setup info:
AWS.config
)