aws / aws-sdk-ruby

The official AWS SDK for Ruby.
https://aws.amazon.com/sdk-for-ruby/
Apache License 2.0
3.56k stars 1.22k forks source link

AWS::SQS::Errors::SignatureDoesNotMatch errors on Rubinius #455

Closed yorickpeterse closed 7 years ago

yorickpeterse commented 10 years ago

Since switching a daemon over to Rubinius we've been getting quite a few of the following errors:

AWS::SQS::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'POST
/

content-length:99
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.32.0 rbx/2.1.0 x86_64-linux-gnu
x-amz-content-sha256:5e204b2282fa47a4ecee2d1d985468c5a342980c025cad72584efcc69eb2f938
x-amz-date:20140128T112115Z

content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date
5e204b2282fa47a4ecee2d1d985468c5a342980c025cad72584efcc69eb2f938'

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20140128T112115Z
20140128/eu-west-1/sqs/aws4_request
b511d5ea84ff04bd3445ec8f15fdf21b458a41d7b452a4057018dd66a2e3cfab'

The corresponding stack trace:

File "/var/www/review_persister/deploy-2014-01-28_11_20_30/vendor/bundle/rbx/2.1/gems/aws-sdk-1.32.0/lib/aws/core/client.rb" line 368 in return_or_raise
File "/var/www/review_persister/deploy-2014-01-28_11_20_30/vendor/bundle/rbx/2.1/gems/aws-sdk-1.32.0/lib/aws/core/client.rb" line 469 in client_request
File "(eval)" line 3 in get_queue_url
File "/var/www/review_persister/deploy-2014-01-28_11_20_30/vendor/bundle/rbx/2.1/gems/aws-sdk-1.32.0/lib/aws/sqs/queue_collection.rb" line 161 in url_for
File "/var/www/review_persister/deploy-2014-01-28_11_20_30/vendor/bundle/rbx/2.1/gems/aws-sdk-1.32.0/lib/aws/sqs/queue_collection.rb" line 139 in named

The root frame of the error is our own code which creates an instance of AWS::SQS and then tries to retrieve a queue by its name. This happens in a threaded environment (10 threads to be exact) under Rubinius. MRI does not (so far) seem to be affected.

It's not exactly clear to me what's causing it. The individual AWS::SQS and related instances are not shared between threads, nor are any of our own AWS related operations.

As a temporary fix I disabled checksum validation for SQS operations but I'd rather not keep that disabled in the long run.

Setup info:

yorickpeterse commented 10 years ago

To clarify, we've been running the same code base under MRI for a good 2 months now without any of these errors popping up. Last Friday I switched things over to Rubinius and they've since started to occur. It could be a bug in Rubinius but the stack trace doesn't show any clear signs of this sadly.

yorickpeterse commented 10 years ago

And it seems disabling checksum validation doesn't do anything, the error just re-occurred.

yorickpeterse commented 10 years ago

Extra info: the error seems to occur when a thread encounters an error, kicks off the error handling process and is then re-started (using a retry). I'll see if I can set up a way to reproduce this.

yorickpeterse commented 10 years ago

Managed to reproduce it using the following script on Rubinius:

require 'aws'
require 'thread'

AWS.config(
  :access_key_id     => '...SNIP...',
  :secret_access_key => '...SNIP...',
  :sqs_region        => 'eu-west-1',
  :sqs_endpoint      => 'sqs.eu-west-1.amazonaws.com'
)

AWS.eager_autoload!(AWS::SQS)
AWS.eager_autoload!(AWS::Core)

def do_work
  sqs   = AWS::SQS.new
  queue = sqs.queues.named('test')

  queue.poll do |message|
    sleep 2 # Fake some work
    raise 'Fake error to showcase restarting of threads'
  end
end

threads = []
mutex   = Mutex.new

10.times do
  threads << Thread.new do
    begin
      do_work
    rescue => error
      # Don't mess up console output
      mutex.synchronize { puts "Error: #{error.message}" }

      sleep 2
      retry
    end
  end
end

threads.each(&:join)
yorickpeterse commented 10 years ago

It's worth noting that the behaviour is rather random. In some runs I can't reproduce it, some other runs it happens basically right away.

trevorrowe commented 10 years ago

Its curious that it is generating a signature error. Thank you for all of the details. I'll see what I can do about reproducing on my end.

yorickpeterse commented 10 years ago

Any news on this? I'll be performing some extra tests today on the latest version of Rbx but I believe the issue still persists. Not yet sure yet if it's an issue with the AWS SDK or Rbx though.

yorickpeterse commented 10 years ago

Extra note: this issue still persists on Rbx 2.2.6. Not sure yet if it's an Rbx or aws-sdk issue.

yorickpeterse commented 10 years ago

Any news on this? I'd love to somehow fix this in whatever way we can in Rbx. However, I'm not familiar with the inner workings of the aws-sdk and have never seen anything like this outside of it.

trevorrowe commented 10 years ago

Sorry for the slow response. I spent quite a bit of time reproducing and trying to resolve the issue from my end. Here are some of my findings:

The authorization header is contains a signature that is computed roughly like so:

OpenSSL::HMAC.hexdigest( OpenSSL::Digest.new('sha256'), derived_key, string_to_sign)

I have verified I am working with the correct string to sign. That leaves the possible sources of discrepancy to:

The derived key is is calculated like so:

def derive_key(datetime)
  k_secret = credentials.secret_access_key
  k_date = hmac("AWS4" + k_secret, datetime[0,8])
  k_region = hmac(k_date, region)
  k_service = hmac(k_region, service_name)
  k_credentials = hmac(k_service, 'aws4_request')
end

Given the intermittent failure modes, this leeds me to strongly suspect a thread-safety issue. The v4 signer that computes the signature does not share state between calls to sign (a new digest is always created), except the read only credentials.

Is it possible that the OpenSSL::HMAC or OpenSSL::Digest modules are not thread safe? I'm open to suggestions.

yorickpeterse commented 10 years ago

I'm not aware of any thread-safety issues with OpenSSL::Digest. The code used by Rubinius for this is the same code as used in MRI with some changes to make it compatible with Rbx. Assuming this were thread-unsafe it should also affect MRI, though the GIL probably prevents it from occuring. Having said that, it could be that indeed OpenSSL is somehow not thread-safe.

Is there a way this can be tested without the whole aws-sdk stack? Since you mentioned that the various signatures match it could be that something else other than OpenSSL is interfering along the line.

JoshMcKin commented 10 years ago

Hate to barge in with some possibly not very useful information. In the early development of EmAws I saw this error when I hadn't patched the Mutex with the Em.synchrony's fiber safe mutex. I 90% sure there use to be a mutex around fetching request credentials or a least a portion like the session_token but cannot be sure. I didn't create an issue in EmAws at the time because it was before it was ever released.

This was so long ago it might not be related but thought I'd mention it.

EmAws Mutex patch: https://github.com/JoshMcKin/em_aws/blob/master/lib/em_aws/patches.rb

trevorrowe commented 10 years ago

@YorickPeterse Yes, we should be able to test and isolate this further. I simply ran out of time yesterday.

Some more background on the error message:

The two strings the service echos back in the error message are the:

The canonical request is a string compiled from the request uri, headers and a full sha256 checksum of the http request body.

After generating the canonical request, you generate a "string to sign" which is comprised of a fixed string prefix, the current time, the credential scope (date/region/service name/fixed suffix) and finally a hexdigest of the canonical request string.

In my local test script (I will try to share this later), I am comparing the canonical request and string to sign that I generate with those returned by the service. They match perfectly in every error. Given the canonical request contains a sha256 checksum of the body, and the service checksum matches ours, I am positive the request is not getting corrupted over the wire.

This points to a bad signature. The final signature is created by computing a hex digest of the string to sign using a derived key (the derived key is your secret access key signed recursively with each portion of the credential scope).

The service validates the signature by computing and comparing signatures. In the case of failure, it echos back the computed fingerprints (canonical request and string to sign). It does not echo back the computed signature, but that is what fails to match. This is why I feel like the error is happening while generating the signature.

@JoshMcKin Not unhelpful, but I don't think that is the cause. In my test example, I am creating a single sqs client (with a single credentials object). That object is used to get my queue by name outside the concurrent sections of code. This should guarantee my static credentials are ready to read/reuse. You are correct, the non-static credential providers use a mutex to allow them to update.

I won't much have time to look at this today, I'm heading out of town this afternoon. I'll try to pick this back up on Monday though.

trevorrowe commented 10 years ago

@YorickPeterse I have not been able to reproduce this error locally single switching over to OpenSSL::Digest. Can you confirm?

yorickpeterse commented 10 years ago

@trevorrowe I haven't tried this yet with using OpenSSL::Digest instead of plain Digest. I'll give this a try tomorrow.

yorickpeterse commented 10 years ago

Sorry for the late reply. I gave this a shot using aws-sdk 1.43.2 and have not yet experienced the error. I'll do some further testing in the coming days to be absolutely sure it's gone.

yorickpeterse commented 10 years ago

Upon further testing this problem does seem to still occur on Rubinius 2.2.10 using aws-sdk 1.46.0. In this case I'm getting errors such as the following:

AWS::SQS::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'POST
/345153512707/shepherd

content-length:171
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.46.0 rbx/2.1.0 x86_64-linux-gnu
x-amz-content-sha256:0edb693803f329fa2976405b32df33f1bd0600f5aeea7b5d3cd15a3bec7ab9ab
x-amz-date:20140703T151611Z

content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date
0edb693803f329fa2976405b32df33f1bd0600f5aeea7b5d3cd15a3bec7ab9ab'

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20140703T151611Z
20140703/eu-west-1/sqs/aws4_request
7447fe83fa48a596773ebecb9503237a2e4614988f6bb696bbe26fed8b6394ef
yorickpeterse commented 10 years ago

Contents of the Gemfile:

source 'https://rubygems.org'

gem 'activerecord', '~> 4.0'
gem 'mysql2'

gem 'rollbar'
gem 'daemon-kit'
gem 'aws-sdk', '~> 1.0'
gem 'logstash-file'
gem 'oni', '~> 3.0'
gem 'dotenv'
gem 'json', '>= 1.8.1'

group :development, :test do
  gem 'pry'
  gem 'pry-doc'
  gem 'pry-theme'

  gem 'rspec', '~> 3.0'
  gem 'simplecov'
  gem 'rake'
end

group :yard do
  gem 'yard'
  gem 'kramdown'
end

And Gemfile.lock:

GEM
  remote: https://rubygems.org/
  specs:
    activemodel (4.1.4)
      activesupport (= 4.1.4)
      builder (~> 3.1)
    activerecord (4.1.4)
      activemodel (= 4.1.4)
      activesupport (= 4.1.4)
      arel (~> 5.0.0)
    activesupport (4.1.4)
      i18n (~> 0.6, >= 0.6.9)
      json (~> 1.7, >= 1.7.7)
      minitest (~> 5.1)
      thread_safe (~> 0.1)
      tzinfo (~> 1.1)
    arel (5.0.1.20140414130214)
    aws-sdk (1.46.0)
      json (~> 1.4)
      nokogiri (>= 1.4.4)
    builder (3.2.2)
    coderay (1.1.0)
    daemon-kit (0.3.0)
      eventmachine (>= 0.12.10)
      thor
    diff-lcs (1.2.5)
    docile (1.1.5)
    dotenv (0.11.1)
      dotenv-deployment (~> 0.0.2)
    dotenv-deployment (0.0.2)
    eventmachine (1.0.3)
    i18n (0.6.9)
    json (1.8.1)
    kramdown (1.4.0)
    logstash-file (0.2.0)
      json
    method_source (0.8.2)
    mini_portile (0.6.0)
    minitest (5.3.5)
    multi_json (1.10.1)
    mysql2 (0.3.16)
    nokogiri (1.6.2.1)
      mini_portile (= 0.6.0)
    oni (3.1.0)
    pry (0.10.0)
      coderay (~> 1.1.0)
      method_source (~> 0.8.1)
      slop (~> 3.4)
    pry-doc (0.6.0)
      pry (~> 0.9)
      yard (~> 0.8)
    pry-theme (1.1.2)
      coderay (~> 1.1)
      json (~> 1.8)
    rake (10.3.2)
    rollbar (0.13.1)
      multi_json (~> 1.3)
    rspec (3.0.0)
      rspec-core (~> 3.0.0)
      rspec-expectations (~> 3.0.0)
      rspec-mocks (~> 3.0.0)
    rspec-core (3.0.2)
      rspec-support (~> 3.0.0)
    rspec-expectations (3.0.2)
      diff-lcs (>= 1.2.0, < 2.0)
      rspec-support (~> 3.0.0)
    rspec-mocks (3.0.2)
      rspec-support (~> 3.0.0)
    rspec-support (3.0.2)
    simplecov (0.8.2)
      docile (~> 1.1.0)
      multi_json
      simplecov-html (~> 0.8.0)
    simplecov-html (0.8.0)
    slop (3.5.0)
    thor (0.19.1)
    thread_safe (0.3.4)
    tzinfo (1.2.1)
      thread_safe (~> 0.1)
    yard (0.8.7.4)

PLATFORMS
  ruby

DEPENDENCIES
  activerecord (~> 4.0)
  aws-sdk (~> 1.0)
  daemon-kit
  dotenv
  json (>= 1.8.1)
  kramdown
  logstash-file
  mysql2
  oni (~> 3.0)
  pry
  pry-doc
  pry-theme
  rake
  rollbar
  rspec (~> 3.0)
  simplecov
  yard
yorickpeterse commented 10 years ago

Ok I did some testing and both Digest and OpenSSL::Digest instances are not safe to share across threads. On MRI this seems to magically work, probably due to the lack of proper threading. On Rubinius doing so will rather quickly result in either a Ruby exception or an entire VM crash when running the following:

require 'thread'
require 'digest'
require 'digest/sha1'

Thread.abort_on_exception = true

expected = '0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33'
threads  = []
digest   = Digest::SHA1.new

100.times do
  threads << Thread.new do
    loop do
      got = digest.hexdigest('foo')

      unless got == expected
        raise "#{got} is not #{expected}"
      end
    end
  end
end

threads.each(&:join)

Using https://github.com/aws/aws-sdk-ruby/search?q=%22OpenSSL+Digest%22&type=Code there seem to be some places where Digest instances are kept around. For example, https://github.com/aws/aws-sdk-ruby/blob/ee75b781dac33fae825b2f652d8aa6234b6ddf59/lib/aws/core/signers/version_4/chunk_signed_stream.rb#L149 stores an instance variable with a Digest instance.

I'll do some digging to see if those particular bits are shared between threads.

trevorrowe commented 10 years ago

The chunk signed stream class you linked to is currently not used by the SDK. Each client constructs a single signer and reuses that across requests. However, I don't believe any of the signer classes cache digest objects (except the un-used chunk signer).

It could be fairly easy to test. The base client class defines a #sign_request method. This method memoizes the request signer. If this memoization were removed, the #sign_request method would construct a new signer for every request.

#sign_request is defined here:

https://github.com/aws/aws-sdk-ruby/blob/master/lib/aws/core/client.rb#L701

yorickpeterse commented 10 years ago

@trevorrowe The code in particular is a tad confusing and I'm not sure what method exactly to override. That is, AWS::Core::Client does not seem to have the method signature_version nor sign_request as either an instance or class method (at least according to Pry). The API documentation (http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/Core/Client.html) also makes no mention of said method.

Basically my idea is to override the method so that it doesn't memoize the signer, then test if that works without throwing any signature errors.

trevorrowe commented 10 years ago

Each service can choose a different signature version / signer. For this reason, there is a macro .signature_version that defines the #sign_request method.

As a quick monkey patch, the following should work:

class AWS::SQS::Client
  def sign_request(req)
    args = [credential_provider, 'sqs', req.region]
    signer = AWS::Core::Signers::Version4.new(*args)
    signer.sign_request(req)
    req
  end
end

The code this is replacing looks like so:

def signature_version(version, service_signing_name = nil)
  define_method(:sign_request) do |req|
    @signer ||= begin
      signer_class = AWS::Core::Signers.const_get(version)
      signer_args = (version == :Version4) ?
        [credential_provider, service_signing_name, req.region] :
        [credential_provider]
      signer_class.new(*signer_args)
    end
    @signer.sign_request(req)
    req
  end
end

Notice, the default implantation caches the signer to the client instance. The monkey patched version creates a new signer for each request.

yorickpeterse commented 10 years ago

After testing said patch it appears that removing the caching of the signer instance doesn't prevent the errors from occurring. A global lock however seems to do the trick so far:

class AWS::SQS::Client
  GLOBAL_LOCK = Mutex.new

  def sign_request(req)
    GLOBAL_LOCK.synchronize do
      args = [credential_provider, 'sqs', req.region]
      signer = AWS::Core::Signers::Version4.new(*args)
      signer.sign_request(req)
      req
    end
  end
end

This might suggest that the actual AWS client instance is shared between threads. This would be odd considering the application in question itself shares nothing AWS related between threads. Instead each SQS connection runs in its own, isolated thread with its own client. I'll try to see if the client is indeed accessed from multiple threads.

yorickpeterse commented 10 years ago

I applied the following patch to our codebase:

class AWS::SQS::Client
  GLOBAL_LOCK = Mutex.new

  def sign_request(req)
    id = Thread.current.instance_variable_get(:@thread_id)
    puts "Thread ##{id} accessing #{self.class} ##{object_id}"

    GLOBAL_LOCK.synchronize do
      args = [credential_provider, 'sqs', req.region]
      signer = AWS::Core::Signers::Version4.new(*args)
      signer.sign_request(req)
      req
    end
  end
end

Running the application results in the following output:

Thread #12 accessing AWS::SQS::Client::V20121105 #165844
Thread #16 accessing AWS::SQS::Client::V20121105 #165844
Thread #11 accessing AWS::SQS::Client::V20121105 #165844
Thread #14 accessing AWS::SQS::Client::V20121105 #165844
Thread #10 accessing AWS::SQS::Client::V20121105 #165844
Thread #13 accessing AWS::SQS::Client::V20121105 #165844
Thread #12 accessing AWS::SQS::Client::V20121105 #165844
Thread #16 accessing AWS::SQS::Client::V20121105 #165844
Thread #11 accessing AWS::SQS::Client::V20121105 #165844
Thread #17 accessing AWS::SQS::Client::V20121105 #165844
Thread #13 accessing AWS::SQS::Client::V20121105 #165844
Thread #9 accessing AWS::SQS::Client::V20121105 #165844
Thread #8 accessing AWS::SQS::Client::V20121105 #165844
Thread #15 accessing AWS::SQS::Client::V20121105 #165844
Thread #10 accessing AWS::SQS::Client::V20121105 #165844
Thread #14 accessing AWS::SQS::Client::V20121105 #165844
Thread #12 accessing AWS::SQS::Client::V20121105 #165844
Thread #16 accessing AWS::SQS::Client::V20121105 #165844
Thread #13 accessing AWS::SQS::Client::V20121105 #165844
Thread #17 accessing AWS::SQS::Client::V20121105 #165844
Thread #11 accessing AWS::SQS::Client::V20121105 #165844
Thread #10 accessing AWS::SQS::Client::V20121105 #165844
Thread #8 accessing AWS::SQS::Client::V20121105 #165844
Thread #14 accessing AWS::SQS::Client::V20121105 #165844
Thread #15 accessing AWS::SQS::Client::V20121105 #165844
Thread #16 accessing AWS::SQS::Client::V20121105 #165844
Thread #12 accessing AWS::SQS::Client::V20121105 #165844
Thread #9 accessing AWS::SQS::Client::V20121105 #165844
Thread #11 accessing AWS::SQS::Client::V20121105 #165844
Thread #13 accessing AWS::SQS::Client::V20121105 #165844
Thread #14 accessing AWS::SQS::Client::V20121105 #165844
Thread #8 accessing AWS::SQS::Client::V20121105 #165844
Thread #15 accessing AWS::SQS::Client::V20121105 #165844
Thread #16 accessing AWS::SQS::Client::V20121105 #165844
Thread #12 accessing AWS::SQS::Client::V20121105 #165844
Thread #17 accessing AWS::SQS::Client::V20121105 #165844
Thread #9 accessing AWS::SQS::Client::V20121105 #165844
Thread #11 accessing AWS::SQS::Client::V20121105 #165844
Thread #15 accessing AWS::SQS::Client::V20121105 #165844
Thread #16 accessing AWS::SQS::Client::V20121105 #165844
Thread #13 accessing AWS::SQS::Client::V20121105 #165844
Thread #12 accessing AWS::SQS::Client::V20121105 #165844
Thread #8 accessing AWS::SQS::Client::V20121105 #165844
Thread #14 accessing AWS::SQS::Client::V20121105 #165844

The last number in each line is the object ID of the client in question. Perhaps I'm missing something but this would suggest that there's a single AWS::SQS::Client instance being shared across all threads.

yorickpeterse commented 10 years ago

Worth noting: the @thread_id variable is Rubinius specific and its a hack to get it out this way. As far as I know you can't get thread IDs reliably on MRI.

trevorrowe commented 10 years ago

Thats interesting.

Given the global mutex around the signing code seems to resolve the problem, that implies one of a few possible causes:

The v4 signer has 3 instance variables, which are the credential, the service name string and the region string. What type of credentials object are you testing with? Can you dump out the credentials class name?

JoshMcKin commented 10 years ago

This may not matter but rubinius 2.2.9 is set to ruby version 2.1.0 but the open ssl threadsafe patch is in MRI Ruby 2.1.1+

yorickpeterse commented 10 years ago

@JoshMcKin What threadsafe patch? It could be that we have to update rubysl-openssl accordingly.

@trevorrowe Using this code:

class AWS::SQS::Client
  GLOBAL_LOCK = Mutex.new

  def sign_request(req)
    puts "Region: #{@region.inspect}"
    puts
    puts "Service: #{@service_ruby_name.inspect}"
    puts
    puts "Credentials: #{@credential_provider.inspect}"
    puts '---'
    puts

    GLOBAL_LOCK.synchronize do
      args = [credential_provider, 'sqs', req.region]
      signer = AWS::Core::Signers::Version4.new(*args)
      signer.sign_request(req)
      req
    end
  end
end

The output is as following:

Region: "eu-west-1"

Service: "sqs"

Credentials: #<AWS::Core::CredentialProviders::DefaultProvider:0x28768 @providers=[#<AWS::Core::CredentialProviders::StaticProvider:0x28770 @cache_mutex=#<Mutex:0x28774 @owner=nil> @static_credentials={} @cached_credentials={}>, #<AWS::Core::CredentialProviders::ENVProvider:0x28784 @prefix="AWS" @cache_mutex=#<Mutex:0x28788 @owner=nil> @suffixes={:access_key_id=>"ACCESS_KEY_ID", :secret_access_key=>"SECRET_ACCESS_KEY", :session_token=>"SESSION_TOKEN"} @cached_credentials={:access_key_id=>"[CENSORED]", :secret_access_key=>"[CENSORED]"}>, #<AWS::Core::CredentialProviders::ENVProvider:0x287a4 @prefix="AWS" @suffixes={:access_key_id=>"ACCESS_KEY", :secret_access_key=>"SECRET_KEY", :session_token=>"SESSION_TOKEN"}>, #<AWS::Core::CredentialProviders::ENVProvider:0x287b4 @prefix="AMAZON" @suffixes={:access_key_id=>"ACCESS_KEY_ID", :secret_access_key=>"SECRET_ACCESS_KEY", :session_token=>"SESSION_TOKEN"}>, #<AWS::Core::CredentialProviders::SharedCredentialFileProvider:0x287c4 @profile_name="default" @path="/home/yorickpeterse/.aws/credentials">, #<AWS::Core::CredentialProviders::EC2Provider:0x287cc @ip_address="169.254.169.254" @http_open_timeout=1 @http_read_timeout=1 @http_debug_output=nil @port=80>]>
---
trevorrowe commented 10 years ago

Just to rule out the credential provider chain, can you do the following:

# use your own account credentials here
credentials = { access_key_id:'akid', secret_access_key:'secret' }
credentials = AWS::Core::CredentialProviders::StaticProvider.new(credentials)
AWS.config(credential_provider: credentials)

Then re-run your tests without the global mutex around the #sign_request method body.

JoshMcKin commented 10 years ago

@YorickPeterse see: https://github.com/aws/aws-sdk-ruby/issues/525#issuecomment-41809120

yorickpeterse commented 10 years ago

@JoshMcKin Discussed this with @dbussink, apparently we do not have said patch in rubysl-openssl but we do have similar thread-safety changes. We'll have to look into updating said Gem according to MRI's current setup.

yorickpeterse commented 10 years ago

@trevorrowe The error persists when using that snippet (minus the global lock).

codekitchen commented 10 years ago

It probably goes without saying, but as another data point, the same problem affects other AWS::Services as well. I was experiencing very frequent AWS::S3::Errors::Forbidden errors after switching my multithreaded bucket-copying app to Rubinius, until I added the equivalent patch for S3:

class AWS::S3::Client
  GLOBAL_LOCK = Mutex.new

  def sign_request(req)
    GLOBAL_LOCK.synchronize do
      signer = AWS::Core::Signers::S3.new(credential_provider)
      signer.sign_request(req)
    end
  end
end
yorickpeterse commented 9 years ago

It's been a while, so here's a little follow-up. This particular problem still occurs on Rubinius 2.3, although much less often so far. I'll be doing another testing round with the patches discussed above to see how things behave in that case.

yorickpeterse commented 9 years ago

Said application has since been running with the above lock with zero errors popping up from the SQS code of the SDK.

@trevorrowe Do you know of any other way of solving this particular problem without a global lock? While it's rather clear the problem is with the request signing it would be great if this could be solved without having to resort to locks.

trevorrowe commented 9 years ago

I do not know any other solution. The signer is building the correct canonicalized string and string to sign. The only failure appears to be when it uses OpenSSL to generate the digest.

It is possible the SDK could manage a lock, but it would still need to be a global lock. Im open to other suggestions.

yorickpeterse commented 9 years ago

@trevorrowe Is it possible for the aws-sdk to not share client instances between threads, or is that part of the core design? If the latter is the case I think a lock is the only option.

trevorrowe commented 9 years ago

It is not possible to change the client construction behavior without making breaking changes. The v2 Ruby SDK no longer shares client instances.

That said, the client objects should be thread safe. See my comment above: https://github.com/aws/aws-sdk-ruby/issues/455#issuecomment-41341972

The actual bug does not appear to be caused by shared state. the failing call looks like this:

OpenSSL::HMAC.hexdigest( OpenSSL::Digest.new('sha256'), derived_key, string_to_sign)

Have we ruled out that the hexdigest method is not thread safe?

yorickpeterse commented 9 years ago

@trevorrowe Rubinius 2.3 ships with an updated version of Ruby's OpenSSL library, which should contain a bunch of threading related fixes. I'll do some extra digging tomorrow to see if I can still reproduce the error discussed in https://github.com/aws/aws-sdk-ruby/issues/455#issuecomment-47962592 (but using OpenSSL::Digest instead of Digest).

yorickpeterse commented 9 years ago

I'm taking another look into digest/openssl errors without involving the AWS SDK. For this I cooked up the following repository https://github.com/YorickPeterse/rubinius-digest-threading. Locally I can't seem to trigger any digest errors so it's running on Travis now: https://travis-ci.org/YorickPeterse/rubinius-digest-threading/builds/42244201

If this particular error can't be reproduced on its own anymore there are two possible scenario's:

  1. The error just happens very, very rarely
  2. The above problems are caused by aws-sdk, not Rubinius, rubysl-openssl or rubysl-digest

Both cases are possible and at this time it's hard to pinpoint exactly what's the cause.

yorickpeterse commented 9 years ago

Using an AWS specific repro script I do get a bunch of signature mismatch errors. I'm using the following script:

require 'aws'
require 'thread'

AWS.config(
  :access_key_id     => '...',
  :secret_access_key => '...',
  :sqs_region        => 'eu-west-1',
  :sqs_endpoint      => 'sqs.eu-west-1.amazonaws.com'
)

AWS.eager_autoload!(AWS::SQS)
AWS.eager_autoload!(AWS::Core)

def do_work
  sqs   = AWS::SQS.new
  queue = sqs.queues.named('test')

  queue.poll do |message|
    message.delete
  end
end

threads = []
mutex   = Mutex.new

Thread.abort_on_exception = true

100.times do |n|
  threads << Thread.new do
    mutex.synchronize { puts "Starting thread #{n + 1}..." }

    begin
      do_work
    rescue => error
      # Don't mess up console output
      mutex.synchronize { puts "Error: #{error.message}" }

      sleep 2
      retry
    end
  end
end

threads.each(&:join)

After a while this fails with the following errors:

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20141126T221250Z
20141126/eu-west-1/sqs/aws4_request
4290b7aa82740ebf84a725b00abf1d30404ca131d5b9e55dfebafdb2d682e7e9'
Error: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'POST
/591910330996/test

content-length:167
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.59.0 rbx/2.1.0 x86_64-linux-gnu
x-amz-content-sha256:436afced8784b07a74aec44059b716b172abf6cfd470361f3b667aad4561922d
x-amz-date:20141126T221452Z

content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date
436afced8784b07a74aec44059b716b172abf6cfd470361f3b667aad4561922d'

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20141126T221452Z
20141126/eu-west-1/sqs/aws4_request
c7ab82a37dfd6ccfb4ff3b986d88f66f7730e1ba44d01bca3660b35acea7a311'
Error: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'POST
/591910330996/test

content-length:167
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sqs.eu-west-1.amazonaws.com
user-agent:aws-sdk-ruby/1.59.0 rbx/2.1.0 x86_64-linux-gnu
x-amz-content-sha256:436afced8784b07a74aec44059b716b172abf6cfd470361f3b667aad4561922d
x-amz-date:20141126T221452Z

content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date
436afced8784b07a74aec44059b716b172abf6cfd470361f3b667aad4561922d'

...

When this particular error occurs it seems to occur in batches of multiple threads failing at roughly the same time.

yorickpeterse commented 9 years ago

If it helps, while I haven't seen these particular errors on JRuby I did just get the following error:

NameError: uninitialized constant AWS::Core::Signers::S3::Base
org/jruby/RubyModule.java line 2723 in const_missing
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/signers/s3.rb line 59 in signature
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/signers/s3.rb line 53 in authorization
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/signers/s3.rb line 47 in sign_request
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/s3/client.rb line 58 in sign_request
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/client.rb line 492 in client_request
org/jruby/RubyProc.java line 271 in call
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/response.rb line 175 in build_request
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/response.rb line 114 in initialize
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/client.rb line 203 in new_response
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/client.rb line 490 in client_request
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/client.rb line 391 in log_client_request
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/client.rb line 477 in client_request
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/client.rb line 373 in return_or_raise
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/core/client.rb line 476 in client_request
(eval) line 3 in put_object
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/s3/s3_object.rb line 1760 in write_with_put_object
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/s3/s3_object.rb line 612 in write
/usr/local/rvm/gems/jruby-1.7.16.1/gems/aws-sdk-v1-1.61.0/lib/aws/s3/object_collection.rb line 85 in create
/usr/local/rvm/gems/jruby-1.7.16.1/gems/opener-daemons-2.3.2/lib/opener/daemons/uploader.rb line 32 in create
/usr/local/rvm/gems/jruby-1.7.16.1/gems/opener-daemons-2.3.2/lib/opener/daemons/uploader.rb line 18 in upload
/usr/local/rvm/gems/jruby-1.7.16.1/gems/opener-daemons-2.3.2/lib/opener/daemons/worker.rb line 73 in upload_output
/usr/local/rvm/gems/jruby-1.7.16.1/gems/opener-daemons-2.3.2/lib/opener/daemons/worker.rb line 46 in process

While this is in S3 and not SQS I suspect this code is suffering from the same problem.

quinn commented 9 years ago

i see the label "Version 1" on here, but I am seeing this with v2.0.30. I am using MRI 2.1.3p242. error happens after awhile using threads.

yorickpeterse commented 9 years ago

@quinn Can you reproduce it using the script mentioned in https://github.com/aws/aws-sdk-ruby/issues/455#issuecomment-64718616? You probably have to adjust the script a bit for V2 of the SDK though.

quinn commented 9 years ago

@YorickPeterse I think i was mistaken, i'm trying w/o threads and still seeing. sorry for the confusion

yorickpeterse commented 9 years ago

@quinn If possible, could you at least share the error output if you still have it?

quinn commented 9 years ago

sure:

/Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.0.30/lib/seahorse/client/plugins/raise_response_errors.rb:15:in `call': The request signature we calculated does not match the signature you provided. Check your key and signing method. (Aws::S3::Errors::SignatureDoesNotMatch)
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.0.30/lib/aws-sdk-core/plugins/s3_sse_cpk.rb:18:in `call'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.0.30/lib/seahorse/client/plugins/param_conversion.rb:22:in `call'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.0.30/lib/aws-sdk-core/plugins/response_paging.rb:10:in `call'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.0.30/lib/seahorse/client/plugins/response_target.rb:18:in `call'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.0.30/lib/seahorse/client/request.rb:70:in `send_request'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.0.30/lib/seahorse/client/base.rb:216:in `block (2 levels) in define_operation_methods'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/request.rb:24:in `call'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/operations.rb:41:in `call'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/operation_methods.rb:19:in `block in add_operation'
    from copy_assets.rb:8:in `block in <main>'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/batch.rb:75:in `each'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/batch.rb:75:in `each'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/batch.rb:75:in `each'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/collection.rb:18:in `block in each'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/operations.rb:128:in `block in all_batches'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.0.30/lib/aws-sdk-core/pageable_response.rb:77:in `each'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/operations.rb:127:in `all_batches'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/collection.rb:18:in `each'
    from /Users/lsu/.rbenv/versions/2.1.3/lib/ruby/gems/2.1.0/gems/aws-sdk-resources-2.0.30/lib/aws-sdk-resources/collection.rb:18:in `each'
    from copy_assets.rb:5:in `<main>'

It happens after awhile but always on the same object. I am batch copying from one bucket to another using Object#copy_from.

quinn commented 9 years ago

I was not url encoding the copy source param, a bit embarrassing. Still, it's a cryptic error message for this type of error.

yorickpeterse commented 9 years ago

@trevorrowe This particular problem also seems to occur when using V2 of the SDK and Rubinius 2.5.2:

Aws::S3::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/raise_response_errors.rb" line 15 in call
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-core-2.0.40/lib/aws-sdk-core/plugins/s3_sse_cpk.rb" line 18 in call
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/param_conversion.rb" line 19 in call
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-core-2.0.40/lib/aws-sdk-core/plugins/response_paging.rb" line 9 in call
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-core-2.0.40/lib/seahorse/client/plugins/response_target.rb" line 15 in call
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-core-2.0.40/lib/seahorse/client/request.rb" line 70 in send_request
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-core-2.0.40/lib/seahorse/client/base.rb" line 216 in define_operation_methods
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/request.rb" line 24 in call
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/operations.rb" line 41 in call
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/lib/kaf_processor/s3_uploader.rb" line 41 in upload_json
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/lib/kaf_processor/processor/text_nodes.rb" line 24 in process_without_trace_Custom_KafProcessor_Processor_TextNodes_process (process)
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/newrelic_rpm-3.10.0.279/lib/new_relic/agent/method_tracer.rb" line 343 in process_with_trace_Custom_KafProcessor_Processor_TextNodes_process
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/newrelic_rpm-3.10.0.279/lib/new_relic/agent/method_tracer_helpers.rb" line 81 in trace_execution_scoped
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/newrelic_rpm-3.10.0.279/lib/new_relic/agent/method_tracer.rb" line 341 in process (process_with_trace_Custom_KafProcessor_Processor_TextNodes_process)
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/lib/kaf_processor/worker.rb" line 62 in run_processor
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/lib/kaf_processor/worker.rb" line 40 in process
kernel/bootstrap/array.rb" line 76 in each
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/lib/kaf_processor/worker.rb" line 39 in _process (process)
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/config/initializers/newrelic_tracers.rb" line 16 in process_without_newrelic_transaction_trace (process)
(eval)" line 3 in process_with_newrelic_transaction_trace
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/newrelic_rpm-3.10.0.279/lib/new_relic/agent/instrumentation/controller_instrumentation.rb" line 353 in perform_action_with_newrelic_trace
(eval)" line 2 in process (process_with_newrelic_transaction_trace)
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/oni-4.0.0/lib/oni/daemon.rb" line 101 in run_worker
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/oni-4.0.0/lib/oni/daemon.rb" line 85 in process
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/oni-4.0.0/lib/oni/daemon.rb" line 174 in run_thread
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/oni-4.0.0/lib/oni/daemons/sqs.rb" line 35 in receive
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/services/sqs/queue_poller.rb" line 413 in yield_messages
kernel/bootstrap/array.rb" line 76 in each
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/services/sqs/queue_poller.rb" line 412 in yield_messages
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/services/sqs/queue_poller.rb" line 405 in process_messages
kernel/bootstrap/proc.rb" line 20 in call
kernel/common/throw_catch.rb" line 30 in catch
kernel/common/throw_catch.rb" line 7 in register
kernel/common/throw_catch.rb" line 29 in catch
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/services/sqs/queue_poller.rb" line 404 in process_messages
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/services/sqs/queue_poller.rb" line 336 in poll
kernel/common/kernel.rb" line 510 in loop
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/services/sqs/queue_poller.rb" line 331 in poll
kernel/bootstrap/proc.rb" line 20 in call
kernel/common/throw_catch.rb" line 30 in catch
kernel/common/throw_catch.rb" line 7 in register
kernel/common/throw_catch.rb" line 29 in catch
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/aws-sdk-resources-2.0.40/lib/aws-sdk-resources/services/sqs/queue_poller.rb" line 330 in poll
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/oni-4.0.0/lib/oni/daemons/sqs.rb" line 34 in receive
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/oni-4.0.0/lib/oni/daemon.rb" line 174 in run_thread
/var/www/kaf_processor/deploy-2015-04-24_15_08_47/vendor/bundle/rbx/2.1/gems/oni-4.0.0/lib/oni/daemon.rb" line 158 in spawn_thread
kernel/bootstrap/proc.rb" line 20 in call
kernel/bootstrap/thread.rb" line 396 in __run__

screenshot_2015-04-24_17-57-04

There is also a huge amount of autoloading problems occurring when using V2. On V1 we also had these problems but they could be solved by using AWS.eager_autoload!. Because such a method doesn't exist for V2 I had to hack it together myself, basically it would recursively refer to all constants, fixing the problem. I'll be looking into seeing if this is a problem with Rbx's autoload not being thread-safe.

trevorrowe commented 9 years ago

I did not implement eager_auto! load in v2 as my understanding is that autoload in now thread-safe. We should explore this more.

As for the signature errors, the only fix I understand that we've tried that resolves the issue is to put a global mutex around the OpenSSL digest methods. This seems like it shouldn't be necessary. Thoughts?