Closed StefanWutz closed 2 years ago
This feature request feels a bit like a slippery slope. If we go down this route we could end up in a situation where we would have to add a new flag for every new logline that gets introduced. So to better understand the request could you maybe elaborate on why filtering the debug logs before shipping them to the central logging infrastructure would not satisfy the same requirements?
For example, the syslog-release supports setting customer rules, which get applied before the logs are shipped.
We are only interested in requests which are resolved by the upstream dns.
For this reason, this PR introduces a configurable log category (recursion_logging
).
We see bosh-dns request rates of 40req/s (with a peak of 85req/s), only 0,25 req/s (with a peak of 17req/s) are sent to the upstream dns.
Debug logs produce 3 log entries per request for all requests (plus additional debug logs): This results in 120 log entries per second and a peak of 255 log entries per second.
The suggested solution would only log recursions (requests which are sent to the upstream dns). It would not log internal domains and cached domain requests: This would result in 0,25 log entries per second and a peak of 17 log entries per second.
So the suggested approach would reduce the IO and CPU load on the instances significantly.
Furthermore, new bosh-dns releases might introduce additional debug logs, which suddenly popup and might cause additional load.
~300 messages per second doesn't sound like a terribly large volume of messages to filter out. Can you show us evidence that this volume of messages is actually a performance problem?
In general, I really like the idea of selectively logging categories of messages... BIND 9 is one DNS server that does this. However, to do that correctly would be a significant undertaking, and I'm pretty wary of only partially doing such a thing.
This pull request was marked as Stale
because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale
label or comment below.
Please don't close the PR, we are still looking into the performance impact of debug logs and other logging options.
This pull request was marked as Stale
because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale
label or comment below.
Please don't close the PR, we are still working on it.
I'm gonna convert this PR into a draft, since it is still being worked on.
This pull request was marked as Stale
because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale
label or comment below.
It seems that the requirements in #79 would fit here, with only the source IP added to the logged lines.
It seems that the requirements in #79 would fit here, with only the source IP added to the logged lines.
The requirement to log the source ip address seems independent to me of the logging configuration options. I'm however happy if this ends up being bundled within a PR offering a new "recursion-logging" opt-in
Tracing external DNS request is really useful and needed feature in case of DNS issues. Filtering the forwarded logs by custom rules of the syslog-release is not easy to implement when you have many deployments which are managed by different teams. I think that BOSH DNS should provide an option to trace DNS requests with a manageable amount of log entries per request. My personal opinion is that two options make sense here:
enable_dns_tracing
. In the future we can add more log entries to that if required.This pull request was marked as Stale
because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale
label or comment below.
We understand that broadening the configuration surface adds complexity. Following the argumentation of @beyhan, would it be an option to log recursions on INFO
? We're still hesitant to use DEBUG
in production. In the end this is the log level where engineers would expect that they can dump anything in without considering scale. Following-up any changes in debug logging in the future with adjusting syslog_forwarder rules sounds very cumbersome.
This pull request was marked as Stale
because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale
label or comment below.
@rkoster @klakin-pivotal any feedback on our suggestions (see the last two comments)?
@beyhan I stand by my statement made here: https://github.com/cloudfoundry/bosh-dns-release/pull/78#issuecomment-900510251, and have yet to see anything that changes my opinion.
@klakin-pivotal In the mean time we observe ~200 req/s which will produce ~600 messages per second if DEBUG
level is used. Trying to validate whether this amount of log entries causes a performance issue for the logging stack or the VMs itself is not the right path to follow IMHO. People expect that DNS just works without consuming any noticeable amount of CPU, Memory or IO. The main purpose of the VMs where BOSH-DNS is running is to run other workloads. The overhead of BOSH-DNS should be unnoticeable and shouldn't compete with the running workloads (like apps). A reasonable goal for me is to have a default configuration of BOSH-DNS which consumes as less resources as possible and provides the need information to analyse issues. That is why I don't think that running BOSH-DNS in DEBUG
mode on production is a good idea. So the latest suggestion is to change the log level of the external recursions from DEBUG
to INFO
because this information is required when analysing DNS issues caused by the communication with the IaaS providers DNS servers.
People expect that DNS just works without consuming any noticeable amount of CPU, Memory or IO.
I agree. But we're not talking about running DNS, we are talking about debugging DNS. Debugging DNS could take an arbitrarily large amount of resources, depending on how deep you need to go, and your request volume.
A reasonable goal for me is to have a default configuration of BOSH-DNS which consumes as less resources as possible and provides the need information to analyse issues.
I agree, to an extent. My ideal configuration for BOSH DNS would be to log nothing but infrequent INFO statements ("I am starting up!", "I am shutting down!", "I am reloading my configuration!"), and any ERROR or FATAL statements ("My configuration is invalid and I cannot start up!", "I'm crashing", etc, etc.).
Please don't forget that that the statement I linked to contains two parts. The second part is this:
In general, I really like the idea of selectively logging categories of messages... BIND 9 is one DNS server that does this. However, to do that correctly would be a significant undertaking, and I'm pretty wary of only partially doing such a thing.
I agree, to an extent. My ideal configuration for BOSH DNS would be to log nothing but infrequent INFO statements ("I >am starting up!", "I am shutting down!", "I am reloading my configuration!"), and any ERROR or FATAL statements ("My >configuration is invalid and I cannot start up!", "I'm crashing", etc, etc.).
I agree with that and even would extend it to: My ideal configuration is the one which allows me to take a decision during an outage without the need to re-deploy or to re-configure of BOSH DNS (which we are not allowed during an outage) because DNS issues have direct customers impact. At the moment logs for IaaS DNS errors are there but BOSH DNS doesn't provide any logs at INFO level for degraded state of the IaaS DNS services which could cause issues on customers side. This pr is aiming to solve that situation for external DNS request.
In general, I really like the idea of selectively logging categories of messages... BIND 9 is one DNS server that does this. >However, to do that correctly would be a significant undertaking, and I'm pretty wary of only partially doing such a thing.
This will be a really significant undertaking and also the performance of the current implementation has been already validated. Many performance and logging improvements have been implemented to improve the production sate e.g. https://github.com/cloudfoundry/bosh-dns-release/pull/57, https://github.com/cloudfoundry/bosh-dns-release/pull/61, https://github.com/cloudfoundry/bosh-dns-release/pull/54, https://github.com/cloudfoundry/bosh-dns-release/issues/52. Also, if I understand it correctly BIND 9 doesn't recommend the selective logging for production. From their github page:
The --enable-querytrace option causes named to log every step of processing every query. The --enable-singletrace option turns on the same verbose tracing, but allows an individual query to be separately traced by setting its query ID to 0. These options should only be enabled when debugging, because they have a significant negative impact on query performance.
It looks like to me that we won't be able to take a decision based on the discussion here. I would suggest to add this as a topic for the Working Group meeting on 27th of January (the meeting this week will be on a too short notice). @klakin-pivotal what do you think?
My ideal configuration is the one which allows me to take a decision during an outage without the need to re-deploy ... BOSH DNS ...
I agree very strongly with this.
...allows me to make a decision ... without the need to re-configure ... BOSH DNS ...
This is trickier. There are many things that can go wrong with DNS (or a DNS server), and if you don't want to be emitting tons of logs around all the time, then you will need to be able to selectively enable and disable logging.
Like Ruben said way back when, adding individual categories of things to INFO
level is a slippery slope. The next time someone needed to flip on DEBUG
logging to get some other piece of information, they're likely to want that at INFO
level, too. Given enough time, INFO
is too noisy, and folks are switching to NOTICE
level to escape the noise.
BIND 9 doesn't recommend the selective logging for production.
The BIND options that I'm talking about modeling our selective logging system after are these options: http://www.zytrax.com/books/dns/ch7/logging.html
Also, my hunch is that setting the resolver
log (and maybe also queries
) category to debug
will produce the sort of data that the fellow who opened this PR is looking to get.
I would suggest to add this as a topic for the Working Group meeting on 27th of January... @klakin-pivotal what do you think?
I know next to nothing about this Working Group, so I don't have enough information to have an informed opinion.
Would changing the configuration interface to allow for more granular control down the line, once more log topics are introduced be an acceptable solution to y'all?
Something like:
logging_categories:
default: []
example:
- recursion: info
- queries: debug
Instead of:
enable_recursion_logging:
default: false
For me this makes sense. It also provides an approach for future log changes if required.
In general a lightweight approach to introduce logging categories with their own log levels sounds reasonable. From your example it's not yet clear to me how it would actually look like. What I understand is, that we would have the default
category, which would be what all loggers currently use. Then we could add a new category, like recursion
and set it to a different log level. Implementation wise that would for example require that all logging that happens in the context of recursion would need a separately instantiated logger, that would run with the specific recursion log configuration or the default one, if no specific configuration exists. So in this scenario we would configure:
logging_categories:
default: [] # here it is not clear to me, why this is an array and why default is necessary at all
example: # why do we need this hierarchy level, would we expect more names?
- recursion: debug
Could it not look like:
log_level: error # the property that already exists and which is the default
logging_categories:
- recursion: debug
Or:
logging_categories:
- default: error # is doing the same as the original log_level property
- recursion: debug
With the separate log level for e.g. recursions it would be safer to run with debug. Still it could produce more logs than we have now with the original proposal, where we have a single log entry per recursion. And yes the values above don't sound big, but we have rates of 270k/min hitting the logging stack, with 99.x% being recursions and then 1 or 3 or 5 lines makes a big difference. Still it might be an acceptable approach. Extending the logs of the special categories could be something also announced in release notes of new versions and it would be a smaller target to filter out.
@friegger my example was just using the bosh job spec format. Which has the default, description and example keys. Sorry for the confusion, the idea was indeed that the rendered config passed to the bosh-dns process would be something like:
log_level: error # the property that already exists and which is the default
logging_categories:
- recursion: debug
This pull request was marked as Stale
because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale
label or comment below.
This pull request was closed because it has been labeled Stale
for 7 days without subsequent activity. Feel free to re-open this pull request at any time by commenting below.
Thanks everyone for settling on a configuration surface that's acceptable. Once an implementation is ready the PR will be reopened. It will take some time until this work can be picked up.
This feature allows recursion logging to stdout on INFO level. Currently this is only possible using
DEBUG
level which is not an option in production.Using
DEBUG
level in production might overload the logging platform with too many messages. We need recursion messages to analyze our dns traffic, this approach is to reduce the number of log messages.As a followup we need to filter this with a positive list of domains, but first we try to filter the messages on our logging platform.
Co-authored-by: Matthias Vach