PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.62k stars 904 forks source link

Sample-based stream event output for protobuf and dnstap logging #14515

Open johnhtodd opened 1 month ago

johnhtodd commented 1 month ago

Short description

I'd like to see if an additional feature could be added to newFrameStreamTcpLogger, newFrameStreamUnixLogger, protobufServer, & outgoingProtobufServer that would allow for a sampled set of events to be transmitted rather than 100%

Usecase

We are rapidly reaching the point where our telemetry systems are spending more of their time discarding messages than processing the small sample set of results that are left over. We implement a sampled ingestion model, but right now this sampled rate is applied at ingestion on the telemetry server, necessitating all messages being transmitted, received, processed, and then most being immediately thrown away. This is a big waste of resources. If the sampling could be applied at the transmitting side instead of the receiving side, there would be lower overall utilization of resources. This applies for us for both dnstap as well as protobuf streams.

Description

I would like to have a number applied as an option to all of the dnstap and protobuf outputs on dnsdist and recursor that would allow a sample rate to be applied. The rate would be expressed as a ratio number - so for instance the number "20" would mean 5%, 2 would mean 50%, 3 would mean 33%, 4 would mean 25% etc. This would be applied I suppose randomly to each message before transmission, or it could be even deeper in the code - I have no insight on that as long as the distribution is as even as possible across time and is not "bursty".

Additionally, it would be required for this to be reflected somewhere in monitoring statistics, since downstream systems would have to multiply by this figure in order to understand the rate at which samples were being taken. This would need to show up somewhere in the Prometheus stats, for instance, in some fashion that would allow understanding of the rate applied to each socket/session that is sending telemetry data. This might necessitate tags/labels that contain the IP:port of the destination (or socket name) so they could be kept distinct in the statistics set.

omoerbeek commented 1 month ago

I agree that it would be nice to have built-in, for all products.

Currently, for dnsdist I think you can use ProbaRule to get the desired effect. This will lack reporting of the used probability unless you add a custom metric.

johnhtodd commented 1 month ago

ProbaRule would work just for DNSDist; thanks, I had forgotten about that method. I could use the SetMetric to make my own reporting for the probability rule. That would solve the issue for the short term, but I suspect having a consistent, fast way to do this sampling that is built into the creation of the socket itself would be welcome as a universal method across the product line(s).