Open ahupowerdns opened 5 years ago
I think that's only a problem when subscribers do synchronous things which is unusual. Or many subscribers do the same thing very frequently. A random TTL (0<TTL<=remainingTTL) for that qname could be a solution.
Sadly there are millions of VoIP devices which do this. So there is a loop in there that keeps a hostname 'fresh' at all times. But fully randomising the TTL might be an idea yes!
After elucidation by our helpul #powerdns IRC channel, it turns out 'random lower TTLs' do not actually help - you are still setting up a bunch of clients to come at you synchronised around the expiry of your actual TTL - but they now take several queries to get there! So first you randomly lower your TTL from 60 seconds to say 30. After 30 seconds, client comes back, you now give them a randomly lowered TTL of 10 seconds. Next up you give them a 5 second response. And when your actual TTL expires, you get an explosion of queries. So this has made the situation many many times worse.
So, thinking about this: If your jitter can increase the TTL (say by 10%), then you run the risk that the record may never expire if there's enough resolvers chained together. You lose the property that TTLs can only ever decrease and thus must eventually expire.[0]
If the jitter can only decrease the TTL then you end up with: T+0: Recursor receives a 60s TTL. It responds with a 55s TTL. T+55s: Recursor responds with a 3s TTL T+58s: Recursor responds with a 2s TTL T+60s: Recursor expires the entry, everyone that has ever queried it in the last 60s all turn up immediately to refresh their cache too.
So, instead of there being a single spike, there are now lots of requests arriving culminating in the original large spike.
The only real way you can avoid this is to have a way of being able, at some point during the initial TTL, to return TTLs that are larger than the remaining TTL. The obvious way of doing this is by prefetching.
[0]: Although a lot of resolvers don't decrement a TTL that passes through them, which can mean that you might, if you're very unlucky end up with a record that doesn't expire.
[0]: Although a lot of resolvers don't decrement a TTL that passes through them, which can mean that you might, if you're very unlucky end up with a record that doesn't expire.
I agree with Isomer's comments about reducing TTL - badness happens there from an operational perspective.
I thought a bit more about "randomly increasing TTL" as an option for this (as apparently has lsomer) but there would have to be some minimum and maximum real numbers associated with the increase. So 10% increase, with a maximum of 20s, and if TTL is less than 10s, then don't do anything at all. Or create a randomization profile that is more likely to give out lower values in the possible range, which would trend quickly towards expiry of records. It is unlikely that this will lead to infinite TTLs but I also recognize that increasing TTLs is dangerous and leads to unpredictability.
It seems the most elegant way to do this is aggressive, semi-random pre-fetching and reset of the TTL for hot objects. This again will probably need both a percentage and a maximum/minimum absolute values that can be set by the recursive operator. ("If TTL is down to <20% of original value, and is less than 30, and is greater than 5, then perform a pre-fetch for this object if the random number between 0-20 you guess is equal to 1.")
We have the same problem (bazillion clients and an entry with TTL=60), and I was able to do some real world testing now. So far I've had good success by using Lua on the recursor side to bump the affected TTLs an additional random(30,60) seconds. The downside being this setup requires skipping the dnsdist cache, so we have more going towards our resolvers, but at least once it's spread out it's a reasonable level.
Program: Recursor Issue type: Feature request
Add the ability to give records from the cache with the original ttl, and not with the residual.
I would like to more evenly distribute the load on the recurser by caching responses on the client side.
In my case, there are 1-2 recursive servers that serve requests from ~ 8000 clients. Clients have a local cache, but in the current implementation, the recursive server gives the leftover ttl for cache entries. This causes the client-side cache to run almost in sync with the recurser cache. When the RR lifetime expires on the client, it sends a request to the recurser, and since the ttl of this RR expired synchronously for all ~ 8000 clients, they will send a request to the recurser almost simultaneously. My suggestion should help "spread" the load over time, since the RR in the client's cache will not expire synchronously on all clients, and will largely depend on when the client asked for this RR. I went through the RFCs, but I could not find an unambiguous position on this issue.
In general, I propose to make the recursor return the initial ttl, which is quite simple to implement, because the entries in the cache are stored with the initial TTL time, and the remaining TTL time is formed in the process of responding to the client. If I'm wrong - correct me, please.
Thank you for your attention and thank you for your work!
Here's an idea. It's probably bad. What if you:
E.g., with a record that has a TTL of 60 seconds:
T+0: Client query. Cache miss. Fetch record. TTL is 60 seconds. Change TTL to 30 seconds, set dont_decrement_responses_to_clients
flag.
T+10 seconds: Client query. Return cached response with TTL 30.
T+20 seconds: Client query. Return cached response with TTL 30.
T+29 seconds: Client query. Return cached response with TTL 30. Delete record from cache and queue a prefetch.
Advantages: I don't know.
Disadvantages:
As we now support DNS resolver service from millions of devices from a single dnsdist cache, we run into some bulk issues.
Let's say there is a record with a 60 second TTL, which we faithfully decrement, we are creating a population of clients that will all synchronously decide that the record is expired, and start coming back for an update.
Once they come back, they will all receive a fresh answer with TTL=60 again, but this creates a new wave of cached entries with the same 'TTD'.
In one observed circumstance this has led to 160kqps peaks in queries every 15 minutes, from VoIP equipment.
In cases where we serve 'millions' of people from a single cache, we may want to employ some kind of jitter to spread out this coordinated refresh - although I don't yet know how we'd do that.
This is mostly an issue for dnsdist because the recursor typically has a number of separate caches already.