aws / aws-xray-sdk-ruby

The official AWS X-Ray Recorder SDK for Ruby
Apache License 2.0
60 stars 58 forks source link

Logs flooded with `cannot find the current context` #85

Open moritonal opened 1 year ago

moritonal commented 1 year ago

Hihi

On the latest-version, seeing a lot of this in the logs:

E, [2023-02-14T16:46:29.307789 #87679] ERROR -- : cannot find the current context.

This is coming from this line which may be mistakenly hiding a bug?

The code is:

def current_entity
      if entity = Thread.current[LOCAL_KEY]
        entity
      else
        handle_context_missing
      end
end

But when I run Thread.current[LOCAL_KEY] I get uninitialized constant XRay::NetHttp::HTTPInstanceInterceptor::LOCAL_KEY which I imagine is unintended.

Ah, but further research shows this is coming from the Unleash gem I use which runs on a scheduled task on it's own Thread.

So Unleash fires, calling http-fetch which get's intercepted by XRay which throws an error because no context is set.

I've also found an undocumented context_missing option of IGNORE_ERROR which silences the issue.

moritonal commented 1 year ago

Okay, I think I get it.

The HTTP Interceptor uses the following logic:

entity = XRay.recorder.current_entity
capture = !(entity && entity.namespace && entity.namespace == 'aws'.freeze)
if started? && capture && entity
     XRay.recorder.capture(address, namespace: 'remote') do |subsegment|
          protocol = use_ssl? ? 'https'.freeze : 'http'.freeze

Now, .capture actually set's the entity, but you check previously to see if the entity is a specific type to work out if you should skip it. Checking the parameter causes the error to fire.

willarmiros commented 1 year ago

Hi @moritonal - thanks for finding this bug! in #83 we changed this behavior from a runtime error to a log message, but it seems you're right that we should be avoiding the context missing issue in the first place here. Please feel free to submit a PR to address this, otherwise we will take a look when we're able.

moritonal commented 1 year ago

This is making more sense, Unleash runs a scheduled job which runs in a thread and makes an HTTP call. Because this isn't wrapped in the Rails middleware then there isn't an active segment.

This causes the error to be thrown. So... it's not really a bug, it's kind of reasonable that the the sub-call to HTTP is throwing an error because it isn't owned by a particular segment.

Thinking it through we have a few choices I'd like your input on:

Or some combo of all of the above?

moritonal commented 1 year ago

This missing context was fixed with the following monkey-patch to intercept calls from Unleash. Although it does beg the question of how do I tell X-Ray that a message isn't related to a specific request?

module Unleash
  module ToggleFetcherExtensions
    def fetch
      XRay.recorder.begin_segment "Unleash"

      super

      XRay.recorder.end_segment
    end
  end

  # Add the extension to the ToggleFetcher class
  class ToggleFetcher
    prepend ToggleFetcherExtensions
  end
end
moritonal commented 1 year ago

The same effect occurs actually when using it with rake given that doesn't act as a segment all the ActiveQuery calls spam out in some cases thousands of log lines.

willarmiros commented 1 year ago

We would be open to a PR for improving the log message's prescriptiveness.

Although it does beg the question of how do I tell X-Ray that a message isn't related to a specific request?

It seems like your monkeypatch should basically do this! It should produce a single segment named "Unleash" and that segment will parent any subsegments for HTTP calls made in that scope.