DataDog / datadog-go

go dogstatsd client library for datadog
MIT License
353 stars 132 forks source link

how to: handling "statsd buffer is full" error? #251

Closed mhratson closed 2 years ago

mhratson commented 2 years ago

How do callers have to handle statsd buffer is full error?

ATM a retry loop does the job, but I wonder if i'm missing anything since the error is private and presumably not supposed to bubble up all the way to the caller.

// retry this error until the buffer is flushed
for i := 3; i > 0; i -= 1 {
    if err := d.Client.ServiceCheck(&sc); err == nil {
        break
    } else {
        // the error is private - comparing strings
        if err.Error() == "statsd buffer is full" {
            time.Sleep(1 * time.Second)
            continue
        }
        d.Logger.Error(err.Error())
        break
    }
}

Thank!

hush-hush commented 2 years ago

Hi @mhratson,

The client will pack multiple messages into a buffer that is then sent to the Agent. That way we send multiple DSD metrics/events/service_checks at once to the Agent. This error occurs when a buffer is full. The error is then catched which triggers a flush of the current buffer to the sender. The worker then pulls a new buffer from the internal buffer pool and adds the message that didn't fit in the previous buffer to the new empty one.

Therefore this error should never surface to the user unless your service check is larger than a maximum buffer size.

The maximum size of a buffer is equal to WithMaxBytesPerPayload value which default to 1432 bytes for UDP and named pipe and 8192 bytes for UDS (see this documentation). If you increase this value you need to mimic the same change on the Agent side by setting dogstatsd_buffer_size in the datadog.yaml to a value equal or higher (see this documentation and be careful about packets fragmentation).

In your use case I would first double check why your service check is so large (which I think is the main issue). Also if you're using UDP try to move to UDS (which also offers better performances). And lastly for your main question: there is no reason to wait, with your current configuration the service check will never fit and it doesn't mean the DogStatsD client is full.

I agree that the error message can be misleading for users, I'll update it in the next version. Thanks for bringing this up !

mhratson commented 2 years ago

Therefore this error should never surface to the user unless your service check is larger than a maximum buffer size.

Yeah, in which case user has to handle it and keeping it unexported doesn't help as there's no way to compare the error.

While it's not a big problem and caller can still compare error strings it's a fragile approach that I think still worth noting/improving. As well as documenting thee limits in ServiceCheck https://github.com/DataDog/datadog-go/blob/496987906cfdbe16b66fdf01bdacc618958e6ab4/statsd/service_check.go#L22-L37

Thanks!

hush-hush commented 2 years ago

I opened a PR regarding the error wording: https://github.com/DataDog/datadog-go/pull/252