Open bwidermeli opened 3 years ago
Hi @bwidermeli I have never seen an issue opened on this (and definitely haven't seen this in the hot-shots usage I've had). Really interesting. Could you share some of your hot-shots settings? Most interested in the 'protocol' and if you have tried changing that.
Sure! It is indeed an interesting issue. Here are the following implementation that we have in a wrapper of hot-shots.
// Just an example of the defaultConfig object
const defaultConfig = {
host: 'datadog',
prefix: `test.app.`,
maxBufferSize: 10000,
mock: false,
globalTags: {
platform: 'prTest',
application: 'test-app',
scope: 'pre-prod',
},
};
// The following sanitizes the project config to ignore what we won't use:
const projectConfig = {};
const {
globalTags: projectTags
} = projectConfig;
// Next we'll merge the project config with the default one
const config = {
...defaultConfig,
...projectValidConfig,
globalTags: {
...projectTags,
...defaultConfig.globalTags, // Has precedence over the project tags so they can't override them
},
};
/**
* Create a new StatsD client instance
*/
const statsDClient = new StatsD(config);
So, we actually have no protocol settings yet as we were using the hot-shots@6.8.1 version and we are just migrating now to the 8.6.1. Any ideas?
Even when we changed to the updated version, we had no need of setting the protocol though
Just tried deploying a version with protocol set to 'uds' and the problem is the same:
Hmm. Not sure if anybody else has seen this and I'm still stumped. Anything unusual about your DogStatsD usage, have you asked Datadog about that? Also, anything unusual in how you call hot-shots- is it some kind of indirect or functional programming call where it could be accidentally called a bunch of times? As you can see from my questions, I'm wondering about things outside of hot-shots. Outside of adding debug tracing to see the calls going out, not sure what else can be done in this library for this.
Hmm, I did ask Datadog but need to enable the log_level to info and activate the flare for tracing, is there any way we can do that with a configuration? As we cannot execute a linux command on the production instances. Do you have any idea of other way to activate this tracing?
And about how we are calling it, nope, for example the easiest implementation is just an histogram for the response time of the application, and that is just a simple histogram with one value and two tags, and that is also failing unluckily. No logic that can be called a bunch of times :/
Hey Brian, I just checked all this again and I can confirm that the issue is actually here on the hot-shots library, I made a whole complete new implementation on a package, replacing hot-shots and adding an implementation of node-dogstatsd but updating it locally to have all the implementation from hot-shots and it just stops failing, you can see an example on the following screenshot:
Basically with hot-shots that just keeps happening as you saw in the image showing the serviceaction: 1getrecentactivities
Even though I still have no clue on where the issue could be, but with this I wanted to discard if it was an issue on the implementation from our side or from Datadog, and indeed it was not a problem on the implementation, any clue?
My only concerns is if it could be an issue on the buffer with the queue logic or maybe on the callback that receives the remaining Bytes, and maybe it is something there?
Thanks!
Update: Just found the issue, looks like it's a problem with the bufferHolder, when the maxBufferSize is set to 10000 for example with no bufferInterval, we have this problem. I still don't know what the solution would be, but at least we know now that the issue comes from the buffer, when it gets enqueued.
Well that's interesting! Sorry a bit intermittent here for awhile so may be slower to respond. Did changing settings work to fix it? I'd still want to keep this issue open to be clear, until we have a test case and fix in there for the combination, but wanted to make sure that does make it fine.
No problem at all! Yup, setting the maxBufferSize to 0 fixes the issue, even though would be nice to know how to fix it, I'm still looking for a solution on this, as it is much better to send a couple metrics instead of just sending them on each metric we need to push.
Hello Brian!
So, we just found the issue, looks like the problem is that the Datadog Agent has a maximum size buffer of 8k bytes to be received, and we were sending about 10k, there is no place where they actually specify this, so we discovered this on the code and there is actually an issue where they say "8k bytes is more than enough!".
Maybe this could be added in the hot-shots documentation or a warning or something so nobody actually sets a value more than 8192 in the maxBufferSize
.
The max size on the UDP protocol is 65k but they setted that value to 8k.
@bwidermeli Well that's good to know, glad you figured it out. I switched the title of this bug now and will keep it open until the following is done: when using udp the maxBufferSize switches back to size 8192 if set higher than this value with a console log as to why, and the README to updated to indicate the max value allowed
Sounds great Brian! Thanks a lot for taking on this!
From the official agent docs the agent buffer size is configurable.
If you are using a community-supported DogStatsD client that supports buffering, make sure to configure a max datagram size that does not exceed the Agent-side per-datagram buffer size (8KB by default, configurable on the Agent with
dogstatsd_buffer_size
) and the network/OS max datagram size.
@xzyfer Yes, they added that after we (Mercado Libre) had a really long talk with the whole team because of this. And we also added a fixed 8KB to the buffer_size on a wrapper we use on the whole company, so this would never happen again in our projects at least.
This isn't safe as currently implemented because it's comparing the length of strings, which is NOT the same as comparing bytes. If the characters used happen to fall within the ascii range, these will be the same, but each utf-8 character could be up to 4 bytes, making the max safe value 2,048.
This should hold the buffer as a Buffer
rather than waiting until it's about to send. I need to do some testing with it, but something like this:
https://github.com/ehaynes99/hot-shots/commit/89f899020c2254cd45076202a094c4b87e833377
If you have the metric example with value 1 It gets pushed this way: e = 1 ex = 1 exa = 1 exam = 1 examp = 1 exampl = 1 example = 1
I am not sure if this happens due to a problem on the lib or you might know why this is happening, I saw that it might be when the message gets converted into a Buffer, but I don't think the problem is right there, I have been trying to replicate the issue locally with no luck, so I cannot debug it and do not know where it is actually happening. Right now we are having thousands of USD on unused metrics on Datadog due to this. For example:
You can see the issue on the inconsistency tag key or even in the scope or other keys.