bugsnag / bugsnag-go

Automatic panic monitoring for Go and Go web frameworks, like negroni, gin, and revel
https://docs.bugsnag.com/platforms/go
MIT License
203 stars 69 forks source link

Deliver async reports from a goroutine pool #230

Closed opsengine closed 2 weeks ago

opsengine commented 4 months ago

Goal

The goal is to avoid creating a new goroutine for each report when using asynchronous mode. The motivation is that an application that creates more reports than the server can handle, may experience high CPU and memory utilization, which can cause further instability in the application.

This problem has been reported before https://github.com/bugsnag/bugsnag-go/issues/49

Design

Allocate a pool of worker goroutines. When asynchronous mode is enabled, Notify submits the reports to the pool in a non-blocking fashion. If pool is full, the report will be dropped. This approach maintains the asynchronous behavior while avoiding unlimited proliferation of goroutines.

Changeset

Testing

opsengine commented 4 months ago

Hi @DariaKunoichi, can I get some initial feedback on this idea?

DariaKunoichi commented 4 months ago

Hi @opsengine, thank you for preparing two solutions for this problem. Unfortunately we can't use solution with generics because we still need to support golang's older versions. Goroutine pool solution using pond is good but we need to consider how introducing new dependency can affect other users. They may already have a pooling library in use or maybe they want to have full control of the configuration. The solution we are thinking about is providing a way for users to set their own wrapper for the event delivery. That way users with no performance problems don't have to change anything and can still use old solution - but those concerned can just provide their own async pool's submit function and we will wrap delivery with it. That way user is fully responsible for thread pool configuration and we don't have to add a dependency.

opsengine commented 4 months ago

Unfortunately we can't use solution with generics because we still need to support golang's older versions.

Understood. I figured that backward compatibility needs to be preserved and discarded that idea.

Goroutine pool solution using pond is good but we need to consider how introducing new dependency can affect other users

Using pond with a large value of NumGoroutines is functionally equivalent to the old behavior of unlimited goroutines. I don't see how the caller application could notice the difference. So to address your concern we could set the default value to something higher.

They may already have a pooling library in use or maybe they want to have full control of the configuration.

A caller that wants to use its own pooling library will enable synchronous mode, so the pond pool will not be used. We considered writing a wrapper library that manages its own goroutine pool, but figured that if bugsnag offered this feature out of the box, the caller would not need to write any special logic, and just tweak the pool/queue size.

The solution we are thinking about is providing a way for users to set their own wrapper for the event delivery.

This would shift more responsibility to the caller. As a user I would very much prefer that the library did the pooling by default and exposed the relative configs. In other words, it's preferable that the library protects the caller from unlimited goroutine growth as a default behavior, instead of providing ways to address the issue when it has already happened. Dropping some bug reports is much better than compromising the application stability further.

DariaKunoichi commented 4 months ago

Okay, let me explain in more detail. I agree with the solution of providing a default limiter for the goroutines. My team members raised concerns that introducing a dependency library for thread pooling may, for some users, collide with other pooling libraries used. As we don't want to introduce conflicts in the dependencies we may need to write our own goroutine dispatcher with a channel of payloads (or similar solution). However we don't currently have time scheduled for working on this larger change and we would like to consider it alongside some other improvements to delivery for bugsnag-go. That's why we looked at an interim solution - to allow users to provide their own executors for delivery - while we schedule a more complete solution, with default limits, in due course.

opsengine commented 4 months ago

My team members raised concerns that introducing a dependency library for thread pooling may, for some users, collide with other pooling libraries used.

Can you give an example? I can't think of a situation where this would become problematic.

Two alternative ideas to fix this upstream: 1) Rate limiting. User sets how many reports per second are allowed. 2) Circuit breaker. If a number of consecutive reports fails to be delivered, disable bug reporting for a few seconds.

If none of these solutions are possible, we will have to either wrap the library or (worse) fork it.

tomlongridge commented 3 months ago

@opsengine – in the immediate term I would suggest forking this repo and using your solution would be the way to go for your usage.

Whilst we see the benefits of your PR, and I appreciate you taking the time to raise it, we need more time than we have available to our team to consider the impact in full to all users of this SDK. There are a few improvements, including this one, to payload delivery that we'd like to make and I'll look to get them included in our Q3 plan at which point your PR will form part of the discussions.

In the meantime, we'll leave the PR open for future discussion and potential use by other users and pick it up again in due course.

hannah-smartbear commented 2 weeks ago

Hi @opsengine,

Just wanted to let you know that we have released some changes to async event delivery in v2.5.0 of bugsnag-go. The notifier will now use a single goroutine with a channel to queue events. Please do let us know if you have any questions about this.

opsengine commented 2 weeks ago

@hannah-smartbear that's great, thank you! is this feature enabled by default or do I have to edit the configuration?

do I read it correctly that the client will enqueue up to 100 reports and deliver them sequentially to the bugsnag server?

hannah-smartbear commented 1 week ago

Hi @opsengine,

A single goroutine for async delivery is now enabled by default. This will indeed queue up to 100 events, with any new events added while the queue is full to be dropped.

Please let us know if you have any further questions about this.