aws / aws-xray-sdk-ruby

The official AWS X-Ray Recorder SDK for Ruby
Apache License 2.0
60 stars 58 forks source link

Threading issues with Puma #31

Open cbenning opened 5 years ago

cbenning commented 5 years ago

Started seeing these:

#<NoMethodError: undefined method `borrow_or_take' for nil:NilClass> /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/aws-xray-sdk-0.11.2/lib/aws-xray-sdk/sampling/default_sampler.rb:85:in `process_matched_rule' /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/aws-xray-sdk-0.11.2/lib/aws-xray-sdk/sampling/default_sampler.rb:54:in `sample_request?' /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/aws-xray-sdk-0.11.2/lib/aws-xray-sdk/facets/helper.rb:36:in `should_sample?' /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/aws-xray-sdk-0.11.2/lib/aws-xray-sdk/facets/rack.rb:30:in `call' /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/puma-4.0.1/lib/puma/configuration.rb:228:in `call' /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/puma-4.0.1/lib/puma/server.rb:657:in `handle_request' /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/puma-4.0.1/lib/puma/server.rb:467:in `process_client' /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/puma-4.0.1/lib/puma/server.rb:328:in `block in run' /usr/local/rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/puma-4.0.1/lib/puma/thread_pool.rb:135:in `block in spawn_thread'

However we aren't using sampling so I'm not sure whats up.

ss2305 commented 5 years ago

@cbenning The sampling decision coming from trace_header always has the highest precedence. If the trace header doesn't contain sampling decision then it checks to see if sampling is enabled or no in the recorder. If not enabled it returns 'true'.

When you don't use sampling, the sampler looks for previously made decision based on default rule if no path-based rule has been matched. I think that's why you see this error.

cbenning commented 5 years ago

So I guess that means that sampling: true by default and I'm using it without realizing it?

ss2305 commented 5 years ago

@cbenning The default rule traces the first request each second, and five percent of any additional requests across all services sending traces to X-Ray. If the SDK can't reach X-Ray to get sampling rules, it reverts to a default local rule of the first request each second, and five percent of any additional requests per host. This can occur if the host doesn't have permission to call sampling APIs, or can't connect to the X-Ray daemon, which acts as a TCP proxy for API calls made by the SDK.

You can find more in the documentation here.

cbenning commented 5 years ago

This error only happens occasionally, as far as we can tell it is working fine otherwise. Are you suggesting this is a connectivity/latency issue with the local xray daemon?

ss2305 commented 5 years ago

@cbenning If it's sporadic in how this breaks then yes. However if you can reproduce the issue consistently and share with us a sample it would be greatly appreciated.

cbenning commented 5 years ago

Ok @ss2305 I can't reproduce it reliably so I will just treat it as intermittent for now and keep an eye on it.

thanks

cbenning commented 5 years ago

@ss2305 Also, this triggers alerts for us, what is a safe way to suppress them? Can I add a default sampling rule? It still feels to me that this should not be stacktracing if this is just business-as-usual in this situation.

Would increasing the Concurrency setting improve this potentially? The x-ray daemon in our instance is using default config, but I don't see why it would be unreachable

chanchiem commented 5 years ago

We're taking a deeper look at this and we will update it when we have any new findings. Thanks for letting us know about this issue.

cbenning commented 5 years ago

FYI Increasing Concurrency from 8 -> 24 had no effect

thegorgon commented 3 years ago

We're experiencing the same issue - sporadically, requests will fail while making a sampling decision.

Looks like this issue is quite stale. Was there any update or mitigation discovered?

cbenning commented 3 years ago

@thegorgon not that we have discovered. We've basically just turned the sampling rate down to almost nothing and have basically stopped using x-ray with ruby.

willarmiros commented 3 years ago

@thegorgon I know this is a sporadic issue, but are there any patterns with which you've noticed this consistently fails? Any help in reproducing this like a sample app (even if the error only occurs intermittently) would be very useful. For some context from an initial inspection, it looks like this error must be happening because a sampling rule doesn't have a reservoir. Given that SamplingRules are always initialized with a non-Nil Reservoir, I'd guess something weird may be happening in this rule merge logic.

Are you using Centralized sampling? If so can you describe your use case?