PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.63k stars 904 forks source link

Feature request to add duration awareness to max-queue-length option #4954

Open thechile opened 7 years ago

thechile commented 7 years ago

I would like to request a new feature flag with regards to the way max-queue-length works.

At the moment i'm performing some load tests on pdns and pdns-recursor. When testing recursor via resperf i quickly hit problems with the max-queue-length defaults of 5000. This is on a centos 7 server so what happens is within 2 seconds of starting the load test, pdns exists due to backlog of mysql backend traffic at which point the systemd unit immediately restarts the service. If i bump the value of max-queue-length to 50,000 then it works better in as much as the qsize-q value jump up to 30,000 for 1 or 2 seconds and then recover to 0 allowing the load test to complete.

So at times i am aware that DNS traffic will be bursty and might overwhelm the backend but rather than just use the queue size also consider how long the queue size has been high.

I know the overload-queue-length option was added but i would prefer if i could specify a millisecond threshold that the max-queue-length has to be exceeded before the rather brutal killing of the process happens.

thank you

maikzumstrull commented 7 years ago

Maybe the "max-queue-length exceeded" behavior could be softened to tail-drop instead of suicide?

Regarding the OP though, I think you should increase max-queue-length. 5000 ist just a conservative guess, it should be calibrated to however many queries you guesstimate you can drain in 2 seconds or so. Using overload-queue-length is also a good idea for real workloads; it's effectively a head-drop, which has better pushback signalling properties than tail-drop (or suicide).

This might appear to hurt a synthetic benchmark, because you'll get a bunch of dropped queries during warm-up. This points to a deeper issue with your benchmark, though. Unless you're specifically trying to measure how the software behaves during warm-up, the benchmark should discard any results from the warm-up phase and only measure the steady state.

thechile commented 7 years ago

thanks. I did look at overload-queue-length for its head-drop properties but it was unclear when configured what happens when there is actually a problem with the backend and the large queue size persists. It would be good perhaps if i could configure max-queue-length=5000 and overload-queue-length=3000 and have a overload-queue-full-duration option in ms that specifies how long the overload-queue-length option is in affect before overflowing to the value specified in max-queue-length. Then again i think it would also be good to have max-queue-full-duration option.

Then i could use something like this

overload-queue-length=5000              # If queue reaches this value then serve from packet cache only
overload-queue-full-duration=5000       # .. but only for this duration(ms) before overflowing to max-queue-length

max-queue-length=50000                  # Allow backend queue to reach this value
max-queue-full-duration=10000           # .. but only for this duration(ms) before killing pdns process

If there was a configurable so max-queue-length-action could be specified so either tail-drop or suicide could be specified.

thanks.