desec-io / desec-stack

Backbone of the deSEC Free Secure DNS Hosting Service
https://desec.io/
MIT License
399 stars 48 forks source link

nsmaster: extend signature rollovers over time (~days) #535

Open peterthomassen opened 3 years ago

peterthomassen commented 3 years ago

Query for stretching:

UPDATE domains SET last_check = CEIL(UNIX_TIMESTAMP() - rand() * 86400) WHERE last_check IS NOT NULL;

Check:

SELECT count(*) AS lc_count, ceil(lc_mod/3600) MOD 24 AS lc_ceil FROM (SELECT *, last_check AS lc_mod FROM domains) AS b GROUP BY lc_ceil ORDER BY lc_ceil;
peterthomassen commented 3 years ago

We don't need to worry about this with pdns 4.5: https://github.com/PowerDNS/pdns/pull/10196

We thus should not spend time developing a permanent fix. If it resurfaces before pdns 4.5, we can just rerun the above queries.

peterthomassen commented 2 years ago

Although pdns 4.5 has AXFR priority levels, the problem still resurfaces when last_check clusters around similar values on nsmaster. As a result, replication is slowed down especially to remote POPs, and update delays occur that are large enough that alerts are triggered by monitoring.

Recovery automatically happens when replication catches up everywhere eventually (around 30-45 in North America, up to 75 minutes in Asia und South America, and up to 90 minutes in Oceania). Data from today and last week.

This is confirmed by running the above SQL for checking the hour of last_check (modified for Postgres due to a2c259d835c133755e2f10af5ea4b88092ca71e8):

SELECT count(*) AS lc_count, ceil(lc_mod/3600)::integer % 24 AS lc_ceil FROM (SELECT *, last_check AS lc_mod FROM domains) AS b GROUP BY lc_ceil ORDER BY lc_ceil;

Will have to think some more how to address this permanently. (Perhaps running the stretching SQL weekly, but something less hacky would be great.)

@nils-wisiol

peterthomassen commented 2 years ago

For the record, the Postgres statement corresponding to the MySQL UPDATE statement above is:

UPDATE domains SET last_check = CEIL(extract(epoch from now()) - random() * 86400) WHERE last_check IS NOT NULL;

NOTE: This update causes all freshness checks to be uniformly scheduled within the next 24hrs. As a result, some checks will happen "tomorrow" (close to "24hrs from now"), even when signature rollovers are due "today". As a result, publicly visible signatures will only be valid 6 days in the future (instead of the usual 7 or more), which may irritate our monitoring.