getsentry / sentry-ruby

Sentry SDK for Ruby
https://sentry.io/for/ruby
MIT License
930 stars 493 forks source link

`async` option will be dropped. #1522

Open st0012 opened 3 years ago

st0012 commented 3 years ago

The async option is unique to the Ruby SDK. It was designed to help send events asynchronously through different backends (e.g. a Ruby thread, Sidekiq worker...etc.). Depends on the backend, it can pose thread to the system due to its additional extra memory consumption. So it's an option with some trade-offs.

But since version 4.1, the SDK now has its own background worker managed (implemented with the famous concurrent-ruby library). It can handle most of the async

The async Option Approach

  1. The SDK serializes the event and event hint into json-compatible Ruby hashes.
  2. It passes the event payload and hint to the block.
  3. In general, the block would enqueue a background job with the above data.
    • Some earlier apps use a new Ruby thread to send the data. This is unrecommended.
    • With background job libraries like Sidekiq or Resque, this means adding objects into Redis.
    • With delayed_job, this means adding a new delayed_job record.
  4. A background worker (e.g. Sidekiq worker) then picks the event and hint and send it.

Pros

Users can customize their event sending logic. But generally it's just a worker with Sentry.send_event(event, hint).

Cons

The Background Worker

  1. The SDK passes the event and its hint to the background worker (a pool of threads managed by concurrent-ruby).
  2. A worker then picks the event, serializes it, and sents it.

Pros

Cons

This drawback has been addressed in https://github.com/getsentry/sentry-ruby/pull/1617.

Missing Events During A Spike Because of Queue Limit

I know many users have concern about the background worker's 30 events queue limit will make them lose events during a spike. But as the maintainer and a user of this SDK, I don't worry about it because:

  1. The spike is likely to be an urgent case, and that'll probably be fixed in a short time. So not seeing a few instances of other errors should not affect the overall coverage.
  2. Given these characteristics of the SDK's background worker:
    • The default number of background workers are determined by the number of process cores on your machine.
    • They're a lot faster than the using the async approach with a sidekiq/resque...etc. worker due to the reason I described in the issue.
    • A 30-event queue is only shared within the process/web instance, depends on the concurrency model you have. Not at a global level. If there's a spike big enough to overflow the SDK's queue and drop some events, it'll probably overflow your background job queue with the async option too and/or pose a greater damage to your system.
  3. Sentry has a rate-limiting mechanism to prevent overflow on the platform side, which works by both rejecting new events and telling the SDK not to send new events with a 429 response. When the SDK receives a 429 response from Sentry during a spike, it'll stop sending "all events" for a given period of time.

What I'm trying to say is, it's not possible to expect Sentry to accept "all events" during a big spike regardless which approach you use. But when a spike happens, async is more likely to become another bottleneck and/or cause other problems in your system.

My Opinion

The async option seems redundant now and it could sometimes cause more harm. So I think we should drop it in version 5.0.

Questions

The above analysis is only based on my personal usage of the SDK and a few cases I helped debug with. So if you're willing to share your experience, I'd like to know

Even though the decision has been made, we still would like to hear feedback about it:

josh-m-sharpe commented 3 years ago

Long time user of sentry-ruby and sentry-raven. I didn't know this feature existed until a current migration of a large rails app to sentry. I noticed it while reading through the docs and attempted to set it up but immediately ran into issues and elected to defer in order to simplify the migration. In addition, we have a large amount of complex queues. Injecting this option into our system would likely require a bit of thought so as not to bowl over important queues while keeping error reporting timely.

I suppose this is a vote to drop it?

louim commented 3 years ago

Hey! We currently use the async option. Mostly because it was recommended in the docs when we setup the app a long time ago (not sure the background worker option was even a thing at that time πŸ‘΄πŸΌ ).

I'm curious about that part:

The background worker doesn't queue more than 30 events. So even when there's a spike, it's unlikely to consume all the memory.

What would happen if there is a spike in events, let's say from a noisy error filling the queue. Would another error happening at the same time be silently dropped because the queue is full? Or am I misunderstanding how it works?

I'd like to switch away from async because json serialization from the async version mean we have to check two version of the payload when doing custom processing in before_send, ex:

data = event.to_hash
errors = data.dig(:exception, :values) if data[:exception]
errors = data.dig("exception", "values") if data["exception"]
st0012 commented 3 years ago

not sure the background worker option was even a thing at that time πŸ‘΄πŸΌ

Background worker was added since v4.1.0. So it's a new thing for most users I think πŸ™‚

What would happen if there is a spike in events, let's say from a noisy error filling the queue. Would another error happening at the same time be silently dropped because the queue is full? Or am I misunderstanding how it works?

As of now, the queue doesn't distinguish events. So if there's a spike of a particular error, other errors may not make it into the queue. But personally I'm not worry about this because:

  1. The spike is likely to be an urgent case, and that'll probably be fixed in a short time. So not seeing a few instances of other errors should not affect the overall coverage.
  2. Given these characteristics of the SDK's background worker:
    • The default number of background workers are determined by the number of process cores on your machine.
    • They're a lot faster than the using the async approach with a sidekiq/resque...etc. worker due to the reason I described in the issue.
    • A 30-event queue is only shared within the process/web instance, depends on the concurrency model you have. Not at a global level.

If there's a spike big enough to overflow the SDK's queue and drop some events, it'll probably overflow your background job queue with the async option too and/or pose a greater damage to your system.

kzaitsev commented 3 years ago

At my job, we are working fine with the async because we already have the sidekiq in our stack. I can't understand why async can't be an optional way to deliver events without async deprecation? As a solution, you can highlight possible issues with async in the documentation.

Maybe I can't understand the problem because using sentry only for exceptions without APM.

st0012 commented 3 years ago

@Bugagazavr

There are 2 main cost of having this option around:

  1. Among all Sentry SDKs, sentry-ruby is the only one that supports such option. This means we always need to consider this extra condition when making changes to the event sending logic. And it'll make future SDK alignment harder. (It surely made sentry-raven -> sentry-ruby conversion harder)
  2. We need to spend additional effort maintaining the code for this option, and sometimes the result isn't pretty:

https://github.com/getsentry/sentry-ruby/blob/42455c893b7f83000bf144fb2f06fc7c03bf0040/sentry-ruby/lib/sentry/client.rb#L119-L134

https://github.com/getsentry/sentry-ruby/blob/42455c893b7f83000bf144fb2f06fc7c03bf0040/sentry-rails/app/jobs/sentry/send_event_job.rb#L1-L33

If the upside is high, these cost wouldn't be an issue. That's why we have had it for many years. But since we already have a better solution for the problem (background worker) that has much less downside, I don't think it's worth it now.

github-actions[bot] commented 2 years ago

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox πŸ₯€

benoittgt commented 2 years ago

We were using a custom sidekiq task in an async block. But multiple times, we had the issue of

When there is an event spike, it can flood the medium storage (Redis) and take down the entire system.

And it was a nightmare.

We just switched to default by removing async and use the BackgroundWorker that use concurrent-ruby. Work perfectly for us for the moment. I will post a message on the next spike. πŸ˜„

trevorturk commented 2 years ago

I think removing async seems fine. I was using it to queue up a small custom ActiveJob that just called Sentry, but I'm happy to just use the built in method. I agree event spikes are fine to cap at X events since Sentry would drop them anyway and you don't want to overwhelm your system anyway. You might consider in the docs telling users that they can just remove the async line and things should work fine. I'm testing that now to be sure, but I believe that's the recommendation?

st0012 commented 2 years ago

@trevorturk Thanks for the feedback. The current documents on Sentry and RDoc both suggests that it's a deprecated option and points user to this issue for more info πŸ˜„

I believe that's the recommendation?

Yes you should be able to delete it directly without any issue. But if you do, please report it here and I'll fix it ASAP πŸ˜‰

trevorturk commented 2 years ago

It all worked fine when I just deleted it, thank you!

dillonwelch commented 2 years ago

But for script programs, the process often leaves before the worker is able to send the event. This is why hint: { background: false } is required in rake integrations.

Can you elaborate on why this is and why this wouldn't be in https://github.com/getsentry/sentry-ruby/blob/efcf170b5f6dd65c3b047825bddd8fde87fc6b7b/sentry-ruby/lib/sentry/rake.rb?

st0012 commented 2 years ago

@dillonwelch sorry that I forgot to update the description. that issue has been addressed in https://github.com/getsentry/sentry-ruby/pull/1617 and don't exist anymore πŸ™‚

jordan-brough commented 2 years ago

For instrumenting AWS Lambda, it seems like it'd be ideal to use Lambda Extensions to deliver error reports. At first glance it looks like the async option would be the way to integrate with Lambda Extensions. Do any of the Sentry libraries for other languages hook into Lambda Extensions? Would there be another way to do that with the Sentry Ruby SDK?

st0012 commented 2 years ago

@jordan-brough I'm not familiar with AWS Lambda but it looks like the Sentry's Lambda Extension doesn't need language SDKs to work? It looks directly integrated with AWS Lambda. Can you explain your use case with more detail?

sl0thentr0py commented 2 years ago

@jordan-brough currently, AWS lambda on ruby is not a first-class feature like for python/node where we have separate packages/integrations/layers implemented. AWS extensions are also something we are looking into right now for both those ecosystems. Ruby is not on the road map per se, but we would be open to seeing if that's something the community wants. I'm making a new issue to gauge interest since it's tangential here.

jordan-brough commented 2 years ago

@sl0thentr0py thanks πŸ‘ I've commented over in that issue.

jordan-brough commented 2 years ago

@st0012 how Lambda works is AWS spins up an instance of some code you've written and invokes "handler" method with an "event" payload (e.g. a web request). And then:

After the function ... [has] completed, Lambda maintains the execution environment for some time in anticipation of another function invocation. In effect, Lambda freezes the execution environment. When the function is invoked again, Lambda thaws the environment for reuse

https://docs.aws.amazon.com/lambda/latest/dg/runtimes-context.html

So you have some code that runs and then gets "frozen" and then maybe gets "unfrozen" again in the future to handle future events (e.g. web requests). But if you deploy new code, or if there is too long of delay between events then Lambda discards the frozen execution environment.

So if you have some async code you may end up in this situation:

And then your async code may never run if Lambda ends up discarding the frozen execution environment. And Sentry events would get lost.

So currently the way to make Lambda work reliably with Sentry would be to make Sentry operate synchronously. But then you have Sentry possibly slowing down your application code and/or potentially affecting it if there were Sentry bugs/outages etc (even though I'm sure that would never happen! πŸ˜‰). Which I assume is one reason why Sentry runs asynchronously in the first place.

Last year AWS released a solution to this issue called "Lambda Extensions". You can use Lambda extensions to allow a Lambda function to handle application code synchronously while also enqueuing asynchronous events to "extensions" (e.g. a Sentry extension) which don't block the main application code.

A configuration option like the async config discussed in this PR might be a good way to integrate w/ that, though I haven't looked into the code here or the details on that in depth.

hogelog commented 2 years ago

I agree dropping the async option.

I'm developing rails app that uses config.async with Sidekiq, but I will disable this config and use background worker or send synchronously.

I was originally aware of this issue and was considering disabling config.async. However, before disabling it, an error spike caused sidekiq redis down.

I plan to disable config.async and use background worker. I am a little concerned about the queue limit. However, I checked latency of our rails app annd error reporting http request (to sentry.io), queue overflow will probably not occur.

It would be more reassuring to be able to check the queue overflow. sentry-ruby records queue overflow event and send that client report. https://github.com/getsentry/sentry-ruby/blob/5.2.0/sentry-ruby/lib/sentry/client.rb#L63 I would be pleasure if these queue overflow errors visualized on sentry.io's UI.

It doesn't look bad approach sending errors synchronously too. Our rails app's APM reports http request's to sentry.io latency is almost within 300ms. I feel that it is not that much of a problem for rack workers to consume that amount of time when an error occurs.

sl0thentr0py commented 2 years ago

I would be pleasure if these queue overflow errors visualized on sentry.io's UI.

@hogelog this is on the product road map, but I can't give you an ETA on when it will be shipped yet. But we certainly want to expose these statistics to the user eventually.

st0012 commented 2 years ago

It doesn't look bad approach sending errors synchronously too.

@hogelog I should also mention that if you have tracing enabled, transaction events should also be taken into consideration as well. so I won't recommend sending errors synchronously if you do.

hogelog commented 2 years ago

I would be pleasure if these queue overflow errors visualized on sentry.io's UI.

@hogelog this is on the product road map, but I can't give you an ETA on when it will be shipped yet. But we certainly want to expose these statistics to the user eventually.

I'm looking forward to it!

@hogelog I should also mention that if you have tracing enabled, transaction events should also be taken into consideration as well. so I won't recommend sending errors synchronously if you do.

I’m not using tracing, so I was not aware. Considering the future, it may not be good to send events directly. thanks!

ariccio commented 2 years ago

I greatly appreciate this extra warning message! It's a good warning. Can you tell me exactly what I should do? Should I simply remove this code from my codebase, or do I need to add something else?

  config.async = lambda do |event, hint|
    ::Sentry::SendEventJob.perform_later(event, hint)
  end
st0012 commented 2 years ago

@ariccio You can simply delete it πŸ˜‰

vadviktor commented 2 years ago

We at my company have been plagued by the async feature for 1-2 years now; hitting the limit of the payload and the rate limit made us scratch our heads on how to safeguard against those (before Sentry internally remedied them). How we've caught these issues? By seeing Sidekiq being brought to its knees and important messages not processed.

Right now, we have decided to end the async reign, and I was glad to see this feature being deprecated since our last version update. :raised_hands: