AWS::SQS::Errors::SignatureDoesNotMatch errors on Rubinius

yorickpeterse commented 10 years ago

Since switching a daemon over to Rubinius we've been getting quite a few of the following errors:

AWS::SQS::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'POST
/

content-length:99
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.32.0 rbx/2.1.0 x86_64-linux-gnu
x-amz-content-sha256:5e204b2282fa47a4ecee2d1d985468c5a342980c025cad72584efcc69eb2f938
x-amz-date:20140128T112115Z

content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date
5e204b2282fa47a4ecee2d1d985468c5a342980c025cad72584efcc69eb2f938'

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20140128T112115Z
20140128/eu-west-1/sqs/aws4_request
b511d5ea84ff04bd3445ec8f15fdf21b458a41d7b452a4057018dd66a2e3cfab'

The corresponding stack trace:

File "/var/www/review_persister/deploy-2014-01-28_11_20_30/vendor/bundle/rbx/2.1/gems/aws-sdk-1.32.0/lib/aws/core/client.rb" line 368 in return_or_raise
File "/var/www/review_persister/deploy-2014-01-28_11_20_30/vendor/bundle/rbx/2.1/gems/aws-sdk-1.32.0/lib/aws/core/client.rb" line 469 in client_request
File "(eval)" line 3 in get_queue_url
File "/var/www/review_persister/deploy-2014-01-28_11_20_30/vendor/bundle/rbx/2.1/gems/aws-sdk-1.32.0/lib/aws/sqs/queue_collection.rb" line 161 in url_for
File "/var/www/review_persister/deploy-2014-01-28_11_20_30/vendor/bundle/rbx/2.1/gems/aws-sdk-1.32.0/lib/aws/sqs/queue_collection.rb" line 139 in named

The root frame of the error is our own code which creates an instance of AWS::SQS and then tries to retrieve a queue by its name. This happens in a threaded environment (10 threads to be exact) under Rubinius. MRI does not (so far) seem to be affected.

It's not exactly clear to me what's causing it. The individual AWS::SQS and related instances are not shared between threads, nor are any of our own AWS related operations.

As a temporary fix I disabled checksum validation for SQS operations but I'd rather not keep that disabled in the long run.

Setup info:

Ruby: rubinius 2.2.3 (2.1.0 4792e746 2013-12-29 JI) [x86_64-linux-gnu]
AWS SDK version: 1.32.0
Region: eu-west-1 (properly configured using AWS.config)

yorickpeterse commented 9 years ago

@trevorrowe At least on JRuby 1.7 autoload is not thread-safe as far as I'm aware of. I believe it wasn't until JRuby 9000 that it was made thread-safe, but @headius can probably fill you in on that. By the looks of it autoload isn't thread-safe on Rubinius either, I'll do some digging to see if there's anything we can do on our end to fix that.

As for the signature errors, the only fix I understand that we've tried that resolves the issue is to put a global mutex around the OpenSSL digest methods. This seems like it shouldn't be necessary. Thoughts?

In comment https://github.com/aws/aws-sdk-ruby/issues/455#issuecomment-64717739 I discussed this and set up a standalone script that couldn't reproduce anything close to the problem discussed in this issue. The code we use for OpenSSL is pulled directly from MRI commit https://github.com/ruby/ruby/commit/5a58165520d5a429ab69f8d6d952a8ff645452bc. I still have to look into updating to the current version, I vaguely remember having problems last time I tried.

Either way, in V1 the problem was that the client/signature signing instances were shared between threads without any synchronisation in place. Is such a pattern still the case with V2?

headius commented 9 years ago

What is not thread-safe about autoload in JRuby 1.7?

yorickpeterse commented 9 years ago

@headius see https://github.com/aws/aws-sdk-ruby/issues/455#issuecomment-71468196.

headius commented 9 years ago

Those errors do not necessarily indicate that JRuby's autoload is unsafe. Constants can end up missing during any concurrent require logic, regardless of whether autoload is involved, if code attempts to access classes while they're still booting. This can happen if, for example, defined? is used to determine the existence of a class. This is because class definition is not atomic.

Of course it could also indicate a thread-safety problem in JRuby's autoload, but it's definitely not the first place I'd look.

headius commented 9 years ago

Ok, looking at the code, there's an access of Base there, and Base is defined as an autoload...so there's a better chance this is a bug in autoload in JRuby. I'll see if I can come up with a case that reproduces it.

yorickpeterse commented 9 years ago

At least this script https://gist.github.com/YorickPeterse/2efb97451fd27c34aec7 fails on Rubinius with constant missing errors, haven't managed to get it to fail on JRuby yet.

yorickpeterse commented 9 years ago

In https://github.com/rubinius/rubinius/commit/b57399f8f44c9afbe5b2785abd85943a8f46ddd3 I took care of the autoloading problems (as far as I can tell). I finally managed to get my hands on a signature error using the scripts discussed in https://gist.github.com/YorickPeterse/2efb97451fd27c34aec7:

An exception occurred running repro.rb:

 (Aws::SQS::Errors::SignatureDoesNotMatch)

Backtrace:

  Seahorse::Client::Plugins::RaiseResponseErrors::Handler#call at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/raise_response_errors.rb:15
      Seahorse::Client::Plugins::ParamConversion::Handler#call at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/param_conversion.rb:22
                    Aws::Plugins::ResponsePaging::Handler#call at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/aws-sdk-core/plugins/response_paging.rb:10
       Seahorse::Client::Plugins::ResponseTarget::Handler#call at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/response_target.rb:18
                        Seahorse::Client::Request#send_request at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/request.rb:70
              { } in Aws::SQS::Client#define_operation_methods at /home/yorickpeterse/.gem/rbx/2.1.0/gems/aws-sdk-core-2.0.40/lib/seahorse/client/base.rb:216
                                      { } in Object#__script__ at repro.rb:11
                                                     Proc#call at kernel/bootstrap/proc.rb:20
                                                Thread#__run__ at kernel/bootstrap/thread.rb:356

This indicates that the script does eventually reproduce the problem given enough time and luck.

yorickpeterse commented 9 years ago

In a production application the autoloading problems are indeed resolved, sadly the signature errors remain (when using V2). I'll be looking into this today to see if I can solve this problem for good.

trevorrowe commented 9 years ago

@YorickPeterse In production, are you using static credentials, or are you using refreshing credentials, such as an IAM instance profile, assume role credentials, STS session credentials, etc?

yorickpeterse commented 9 years ago

@trevorrowe I've tested this both on an EC2 instance (which uses IAM credentials) and locally (which uses static credentials). I'm currently looking into rubysl-openssl as I suspect it has some problems that might trigger this particular problem. Running my repro script mentioned above with -Xcapi.lock (preventing concurrent C API calls) seems to prevent any errors from occurring, hence my suspicion.

yorickpeterse commented 9 years ago

We recently updated the OpenSSL code of Rubinius to match the code of the latest MRI version (rubysl-openssl is basically a 1:1 copy of MRI's OpenSSL code). Having looked at the code of V2 of the AWS SDK I couldn't find a point where mutable state was shared between threads, leading me to believe it might be OpenSSL that's broken.

To confirm/deny this I set up a new standalone script to try and trigger the problem:

require 'openssl'

Thread.abort_on_exception = true

input  = 'the cake is possibly a lie'
digest = OpenSSL::Digest::SHA256.new

digest.update(input)

expected = digest.hexdigest

puts 'Starting threads...'

threads = 10.times.map do
  Thread.new do
    loop do
      digest = OpenSSL::Digest::SHA256.new
      digest.update(input)

      got = digest.hexdigest

      if got != expected
        raise "Expected digest #{got.inspect} to equal #{expected.inspect}"
      end
    end
  end
end

threads.each(&:join)

On MRI this will run fine for all eternity. On Rubinius on the other hand this pretty much instantly crashes with the following output:

Starting threads...
An exception occurred running /tmp/openssl_thread.rb:

Expected digest "70b9d40465992fa71a861e050481fb1a4bed5c8aa127272e7bf5838a3b0ab240" to equal "83103cff21a7a50e45eb90c29ca9d24204d95fe7369565ec1d5339750da386ab" (RuntimeError)

Backtrace:

  { } in Object#__script__ at /tmp/openssl_thread.rb:23
       Kernel(Object)#loop at kernel/common/kernel.rb:511
  { } in Object#__script__ at /tmp/openssl_thread.rb:16
                 Proc#call at kernel/bootstrap/proc.rb:20
            Thread#__run__ at kernel/bootstrap/thread.rb:356

Interesting enough the digest appears to always be "70b9d40465992fa71a861e050481fb1a4bed5c8aa127272e7bf5838a3b0ab240". However, when running Rubinius with a CAPI lock (using -Xcapi.lock) the produced digest is "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", though the VM sometimes segfault in the OpenSSL code.

To cut a long story short, it seems the OpenSSL extension is beyond broken in multi-threaded environments or Rubinius does some weird things to it. I'll continue digging until I'm sure what's to blame. If it turns out to be unrelated to the AWS SDK itself I'll close this issue.

yorickpeterse commented 9 years ago

Per "rideliner" from the Rubinius Gitter channel, the above snippet is flawed. The local variable digest in the Thread.new block also overwrites the outer variable, leading to the race condition. Somehow I expected locals to at least be thread-local (even in this case). I shall get myself a dunce cap.

Either way, the actual signature problem still persists. I'll continue my investigation, but at least it seems that both rubysl-digest and rubysl-openssl are not the cause for these problems.

yorickpeterse commented 9 years ago

@trevorrowe I literally just got this error on MRI 2.2 as well (though this is using V1), which seems to rule our Rubinius itself being the problem. The error/backtrace is as following (as taken from Rollbar):

AWS::SQS::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'POST
/345153512707/qr_generator

content-length:175
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.63.0 ruby/2.2.2 x86_64-linux
x-amz-content-sha256:47236ab18c30db436023205b5b09bd472e527b24ade58bc3acc21ed4a0d6ad4c
x-amz-date:20150716T181615Z
x-amz-security-token:[REMOVED]

content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-security-token
[REMOVED]'

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20150716T181615Z
20150716/eu-west-1/sqs/aws4_request
8b1031f744460308b9295139a5b1c3992e3bf83e66b70b699fd3479fe2d25c19'

Backtrace:

File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/core/client.rb" line 375 in return_or_raise
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/core/client.rb" line 476 in client_request
File "(eval)" line 3 in receive_message
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/sqs/queue.rb" line 201 in receive_message
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/sqs/queue.rb" line 303 in block in poll
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/sqs/queue.rb" line 301 in loop
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/aws-sdk-v1-1.63.0/lib/aws/sqs/queue.rb" line 301 in poll
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/oni-3.1.0/lib/oni/daemons/sqs.rb" line 34 in receive
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/oni-3.1.0/lib/oni/daemon.rb" line 180 in run_thread
File "/var/www/barqr/deploy-2015-06-17_13_58_37/vendor/bundle/ruby/2.2.0/gems/oni-3.1.0/lib/oni/daemon.rb" line 164 in block in spawn_thread

yorickpeterse commented 9 years ago

For the first time ever I also experienced this issue on JRuby 1.7:

AWS::SQS::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'POST
/345153512707/opener-opinion-detector-basic

content-length:192
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.64.0 jruby/1.9.3 java
x-amz-content-sha256:90638f76a52ee75c5210c44a615ba9e75278e3e54da88972f50cb680501308d6
x-amz-date:20150820T230324Z
x-amz-security-token:[REMOVED]

content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-security-token
[REMOVED]'

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20150820T230324Z
20150820/eu-west-1/sqs/aws4_request
25907c2c04c518ad21ab5a3c66f1a31fa3eaacc87c62cc6aecf740cb44710c14'

Backtrace:

File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/core/client.rb" line 375 in return_or_raise
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/core/client.rb" line 476 in client_request
File "(eval)" line 3 in receive_message
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/sqs/queue.rb" line 201 in receive_message
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/sqs/queue.rb" line 303 in poll
File "org/jruby/RubyKernel.java" line 1501 in loop
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.64.0/lib/aws/sqs/queue.rb" line 301 in poll
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/oni-3.1.1/lib/oni/daemons/sqs.rb" line 34 in receive
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/oni-3.1.1/lib/oni/daemon.rb" line 180 in run_thread
File "/usr/local/rvm/gems/jruby-1.7.16.1/gems/oni-3.1.1/lib/oni/daemon.rb" line 164 in spawn_thread

Ruby info:

jruby 1.7.16.1 (1.9.3p392) 2014-10-28 4e93f31 on OpenJDK 64-Bit Server VM 1.7.0_85-mockbuild_2015_07_20_19_47-b00 +jit [linux-amd64]

awood45 commented 7 years ago

This issue has been inactive for over a year, and V1 is end of life. Happy to continue to look at this, given all the effort in place so far, if we get more information. For now, closing. Again, feel free to reopen if more information comes in.

aws / aws-sdk-ruby

AWS::SQS::Errors::SignatureDoesNotMatch errors on Rubinius #455