Open GuusHoutzager opened 8 years ago
I have the same issue. Can we have some attention please?
If a write plugin fails, collectd is expected to bail out and start dropping metrics sent to this destination. Such an event will be mentioned in the logs. Does it in your respective cases ?
Up to 42a62db (part of 5.5.1) collectd didn't identify a certain type of TCP-related failure scenario as a genuine error, and continued to submit data, until the kernel timed out the stalled socket (which by default in Linux is around 2 hours), causing the error-handling code in collectd to kick in. A workaround to avoid collectd ballooning up and consuming a ton of memory was to use WriteQueueLimit*
.
My understanding is that the write queue is there to handle temporary backpressure from blocking destinations (disk IO saturation preventing rrdtools to fsync in a timely manner, network congestion between collectd and the graphite server, etc). But it's not designed to ensure no data gets lost when the target has a hard failure.
IMHO a better way to have a more resilient setup is to use a dedicated queuing system (amqp or write_kafka plugins, or mqtt in upcoming 5.6 version) and/or configure several concurrent write plugins to ensure redundancy.
Why can't we use the write queue for hard failures(TCP remote point went away) and retry to connect to it later? We can have a very resilient setup but there will be always temporarily failure(failure is not an option and it is always there:-) ), thus I would expect from collectd to buffer metrics(using some limits) and retry to establish the connection every X seconds.
We are still facing this issue. Is there a solution or a bug fix in recent versions of collectd for this?
No, current versions still drop metrics on failure.
collectd doesn't retry failing write callbacks – and never did, the metric was always dropped. However, failing write callbacks often have severely increased latency, for example when they try to re-establish a network connection. This call latency is what fills the queue.
We could re-try failing write calls, now that there's an upper bound on the queue length / memory consumption. This doesn't necessarily achieve the desired effect though. Some plugins, write_http for example, build a buffer and try to send that once its full. If that fails, retrying the last metric is not going to make much of a difference.
Alternatively we could change the write callbacks to retry temporary failures internally (i.e. without returning), potentially blocking for a long while. This causes problems when you have more than one write plugin, because it starves write threads. Previously, when one plugin failed the other ones continued to function; now one failing plugin will halt all reporting.
All of these issues are solvable, but this problem is certainly near the "high complexity" end of the spectrum and that's probably why nobody has tackled it yet. If somebody wants to give this a shot and come up with a good design, I'm happy to review and give advice.
Best regards, —octo
These are possible related to this issue: #2104, #2486, #2480
This issue was discussed in community bi-weekly call as noted in: https://collectd.org/wiki/index.php/MinutesApr17th20
Discussion provided:
Hi,
I remember collectd eating all memory when I used it last when all write plugins fail. I then used the WriteQueueLimitHigh and WriteQueueLimitLow parameters to limit that. Now however, it seems like collectd is not queueing at all anymore. I've tested with 5.4, 5.5 and 5.5.1, but if I break the connection with graphite, I don't see memory usage go up of collectd and if I use the CollectInternalStats option in 5.5 and 5.5.1, I don't see the write queue getting bigger and bigger either. As a result I get holes in my graphs when the connection to graphite is restored since older data doesn't exist anymore? I tested with 5.4 from Ubuntu 14.04 in both default and custom config as well as with 5.5 and 5.5.1 on CentOS 7 with our puppet generated config. Am I doing something stupid or is something broken? Please let me know what you need from me to help debug this. I have various test environments to mess around with. Thanks!
Cheers,
Guus