cloudfoundry / bosh-dns-release

BOSH DNS release
Apache License 2.0
18 stars 36 forks source link

Adding a dedicated recursion logging #78

Closed StefanWutz closed 2 years ago

StefanWutz commented 3 years ago

This feature allows recursion logging to stdout on INFO level. Currently this is only possible using DEBUG level which is not an option in production.

Using DEBUG level in production might overload the logging platform with too many messages. We need recursion messages to analyze our dns traffic, this approach is to reduce the number of log messages.

As a followup we need to filter this with a positive list of domains, but first we try to filter the messages on our logging platform.

Co-authored-by: Matthias Vach

rkoster commented 2 years ago

This feature request feels a bit like a slippery slope. If we go down this route we could end up in a situation where we would have to add a new flag for every new logline that gets introduced. So to better understand the request could you maybe elaborate on why filtering the debug logs before shipping them to the central logging infrastructure would not satisfy the same requirements?

For example, the syslog-release supports setting customer rules, which get applied before the logs are shipped.

KaiHofstetter commented 2 years ago

We are only interested in requests which are resolved by the upstream dns. For this reason, this PR introduces a configurable log category (recursion_logging).

We see bosh-dns request rates of 40req/s (with a peak of 85req/s), only 0,25 req/s (with a peak of 17req/s) are sent to the upstream dns.

Debug logs produce 3 log entries per request for all requests (plus additional debug logs): This results in 120 log entries per second and a peak of 255 log entries per second.

The suggested solution would only log recursions (requests which are sent to the upstream dns). It would not log internal domains and cached domain requests: This would result in 0,25 log entries per second and a peak of 17 log entries per second.

So the suggested approach would reduce the IO and CPU load on the instances significantly.

Furthermore, new bosh-dns releases might introduce additional debug logs, which suddenly popup and might cause additional load.

klakin-pivotal commented 2 years ago

~300 messages per second doesn't sound like a terribly large volume of messages to filter out. Can you show us evidence that this volume of messages is actually a performance problem?

In general, I really like the idea of selectively logging categories of messages... BIND 9 is one DNS server that does this. However, to do that correctly would be a significant undertaking, and I'm pretty wary of only partially doing such a thing.

bosh-admin-bot commented 2 years ago

This pull request was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

KaiHofstetter commented 2 years ago

Please don't close the PR, we are still looking into the performance impact of debug logs and other logging options.

bosh-admin-bot commented 2 years ago

This pull request was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

KaiHofstetter commented 2 years ago

Please don't close the PR, we are still working on it.

rkoster commented 2 years ago

I'm gonna convert this PR into a draft, since it is still being worked on.

bosh-admin-bot commented 2 years ago

This pull request was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

bgandon commented 2 years ago

It seems that the requirements in #79 would fit here, with only the source IP added to the logged lines.

gberche-orange commented 2 years ago

It seems that the requirements in #79 would fit here, with only the source IP added to the logged lines.

The requirement to log the source ip address seems independent to me of the logging configuration options. I'm however happy if this ends up being bundled within a PR offering a new "recursion-logging" opt-in

beyhan commented 2 years ago

Tracing external DNS request is really useful and needed feature in case of DNS issues. Filtering the forwarded logs by custom rules of the syslog-release is not easy to implement when you have many deployments which are managed by different teams. I think that BOSH DNS should provide an option to trace DNS requests with a manageable amount of log entries per request. My personal opinion is that two options make sense here:

bosh-admin-bot commented 2 years ago

This pull request was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

friegger commented 2 years ago

We understand that broadening the configuration surface adds complexity. Following the argumentation of @beyhan, would it be an option to log recursions on INFO? We're still hesitant to use DEBUG in production. In the end this is the log level where engineers would expect that they can dump anything in without considering scale. Following-up any changes in debug logging in the future with adjusting syslog_forwarder rules sounds very cumbersome.

bosh-admin-bot commented 2 years ago

This pull request was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

beyhan commented 2 years ago

@rkoster @klakin-pivotal any feedback on our suggestions (see the last two comments)?

klakin-pivotal commented 2 years ago

@beyhan I stand by my statement made here: https://github.com/cloudfoundry/bosh-dns-release/pull/78#issuecomment-900510251, and have yet to see anything that changes my opinion.

beyhan commented 2 years ago

@klakin-pivotal In the mean time we observe ~200 req/s which will produce ~600 messages per second if DEBUG level is used. Trying to validate whether this amount of log entries causes a performance issue for the logging stack or the VMs itself is not the right path to follow IMHO. People expect that DNS just works without consuming any noticeable amount of CPU, Memory or IO. The main purpose of the VMs where BOSH-DNS is running is to run other workloads. The overhead of BOSH-DNS should be unnoticeable and shouldn't compete with the running workloads (like apps). A reasonable goal for me is to have a default configuration of BOSH-DNS which consumes as less resources as possible and provides the need information to analyse issues. That is why I don't think that running BOSH-DNS in DEBUG mode on production is a good idea. So the latest suggestion is to change the log level of the external recursions from DEBUG to INFO because this information is required when analysing DNS issues caused by the communication with the IaaS providers DNS servers.

klakin-pivotal commented 2 years ago

People expect that DNS just works without consuming any noticeable amount of CPU, Memory or IO.

I agree. But we're not talking about running DNS, we are talking about debugging DNS. Debugging DNS could take an arbitrarily large amount of resources, depending on how deep you need to go, and your request volume.

A reasonable goal for me is to have a default configuration of BOSH-DNS which consumes as less resources as possible and provides the need information to analyse issues.

I agree, to an extent. My ideal configuration for BOSH DNS would be to log nothing but infrequent INFO statements ("I am starting up!", "I am shutting down!", "I am reloading my configuration!"), and any ERROR or FATAL statements ("My configuration is invalid and I cannot start up!", "I'm crashing", etc, etc.).

Please don't forget that that the statement I linked to contains two parts. The second part is this:

In general, I really like the idea of selectively logging categories of messages... BIND 9 is one DNS server that does this. However, to do that correctly would be a significant undertaking, and I'm pretty wary of only partially doing such a thing.

beyhan commented 2 years ago

I agree, to an extent. My ideal configuration for BOSH DNS would be to log nothing but infrequent INFO statements ("I >am starting up!", "I am shutting down!", "I am reloading my configuration!"), and any ERROR or FATAL statements ("My >configuration is invalid and I cannot start up!", "I'm crashing", etc, etc.).

I agree with that and even would extend it to: My ideal configuration is the one which allows me to take a decision during an outage without the need to re-deploy or to re-configure of BOSH DNS (which we are not allowed during an outage) because DNS issues have direct customers impact. At the moment logs for IaaS DNS errors are there but BOSH DNS doesn't provide any logs at INFO level for degraded state of the IaaS DNS services which could cause issues on customers side. This pr is aiming to solve that situation for external DNS request.

In general, I really like the idea of selectively logging categories of messages... BIND 9 is one DNS server that does this. >However, to do that correctly would be a significant undertaking, and I'm pretty wary of only partially doing such a thing.

This will be a really significant undertaking and also the performance of the current implementation has been already validated. Many performance and logging improvements have been implemented to improve the production sate e.g. https://github.com/cloudfoundry/bosh-dns-release/pull/57, https://github.com/cloudfoundry/bosh-dns-release/pull/61, https://github.com/cloudfoundry/bosh-dns-release/pull/54, https://github.com/cloudfoundry/bosh-dns-release/issues/52. Also, if I understand it correctly BIND 9 doesn't recommend the selective logging for production. From their github page:

The --enable-querytrace option causes named to log every step of processing every query. The --enable-singletrace option turns on the same verbose tracing, but allows an individual query to be separately traced by setting its query ID to 0. These options should only be enabled when debugging, because they have a significant negative impact on query performance.

It looks like to me that we won't be able to take a decision based on the discussion here. I would suggest to add this as a topic for the Working Group meeting on 27th of January (the meeting this week will be on a too short notice). @klakin-pivotal what do you think?

klakin-pivotal commented 2 years ago

My ideal configuration is the one which allows me to take a decision during an outage without the need to re-deploy ... BOSH DNS ...

I agree very strongly with this.

...allows me to make a decision ... without the need to re-configure ... BOSH DNS ...

This is trickier. There are many things that can go wrong with DNS (or a DNS server), and if you don't want to be emitting tons of logs around all the time, then you will need to be able to selectively enable and disable logging.

Like Ruben said way back when, adding individual categories of things to INFO level is a slippery slope. The next time someone needed to flip on DEBUG logging to get some other piece of information, they're likely to want that at INFO level, too. Given enough time, INFO is too noisy, and folks are switching to NOTICE level to escape the noise.

BIND 9 doesn't recommend the selective logging for production.

The BIND options that I'm talking about modeling our selective logging system after are these options: http://www.zytrax.com/books/dns/ch7/logging.html

Also, my hunch is that setting the resolver log (and maybe also queries) category to debug will produce the sort of data that the fellow who opened this PR is looking to get.

I would suggest to add this as a topic for the Working Group meeting on 27th of January... @klakin-pivotal what do you think?

I know next to nothing about this Working Group, so I don't have enough information to have an informed opinion.

rkoster commented 2 years ago

Would changing the configuration interface to allow for more granular control down the line, once more log topics are introduced be an acceptable solution to y'all?

Something like:

logging_categories:
  default: []
  example:
  - recursion: info
  - queries: debug

Instead of:

enable_recursion_logging:
  default: false
beyhan commented 2 years ago

For me this makes sense. It also provides an approach for future log changes if required.

friegger commented 2 years ago

In general a lightweight approach to introduce logging categories with their own log levels sounds reasonable. From your example it's not yet clear to me how it would actually look like. What I understand is, that we would have the default category, which would be what all loggers currently use. Then we could add a new category, like recursion and set it to a different log level. Implementation wise that would for example require that all logging that happens in the context of recursion would need a separately instantiated logger, that would run with the specific recursion log configuration or the default one, if no specific configuration exists. So in this scenario we would configure:

logging_categories:
  default: [] # here it is not clear to me, why this is an array and why default is necessary at all
  example: # why do we need this hierarchy level, would we expect more names?
  - recursion: debug

Could it not look like:

log_level: error # the property that already exists and which is the default
logging_categories:
- recursion: debug

Or:

logging_categories:
- default: error # is doing the same as the original log_level property
- recursion: debug

With the separate log level for e.g. recursions it would be safer to run with debug. Still it could produce more logs than we have now with the original proposal, where we have a single log entry per recursion. And yes the values above don't sound big, but we have rates of 270k/min hitting the logging stack, with 99.x% being recursions and then 1 or 3 or 5 lines makes a big difference. Still it might be an acceptable approach. Extending the logs of the special categories could be something also announced in release notes of new versions and it would be a smaller target to filter out.

rkoster commented 2 years ago

@friegger my example was just using the bosh job spec format. Which has the default, description and example keys. Sorry for the confusion, the idea was indeed that the rendered config passed to the bosh-dns process would be something like:

log_level: error # the property that already exists and which is the default
logging_categories:
- recursion: debug
bosh-admin-bot commented 2 years ago

This pull request was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

bosh-admin-bot commented 2 years ago

This pull request was closed because it has been labeled Stale for 7 days without subsequent activity. Feel free to re-open this pull request at any time by commenting below.

friegger commented 2 years ago

Thanks everyone for settling on a configuration surface that's acceptable. Once an implementation is ready the PR will be reopened. It will take some time until this work can be picked up.