WebThingsIO / registration_server

The registration server for WebThings Gateway.
Mozilla Public License 2.0
79 stars 34 forks source link

DNS Outages #99

Open benfrancis opened 1 year ago

benfrancis commented 1 year ago

STR:

Expected:

Actual:

This has been happening regularly for many months now, and requires a reboot of the registration server EC2 instances in order to fix it. We believe it is caused by PowerDNS crashing so that the registration server no longer resolves DNS lookups.

In the logs of the registration server docker container there is an error which says "5001 questions waiting for database/backend attention. Limit is 5000, respawning". pdns then re-spawns and after that happens so many times, the init system in the docker container gives up and just kills it. This is happening on both EC2 instances.

We think that the DNS servers are occasionally getting overwhelmed by traffic but we don't know where it's coming from, I suspect it isn't WebThings users because there are lots of failed lookups for subdomains that don't exist in the logs.

Some potential solutions:

  1. Configuring rate limiting with something like dnsdist to set a limit on queries per second per IP address
  2. Re-configure pdns to use the gmysql back end so that pdns reads records directly from the database, rather than directing them to the registration server which then queries the database
  3. Modify the registration server by adding an option to use a hosted DNS service like Cloudflare as a back end, to take load off our EC2 instances. Downsides being 1. We would be dependent on Cloudflare 2. We'd have to set a TTL limit of minimum 60 seconds, so there would be brief outages when a gateway changes IP (but at least not the whole domain)
  4. Same as number 3, but re-write the registration server in Node.js so that more people are able to work on it (we have an IoT gateway written in Node.js and a cloud service written in Rust and it should probably be the other way around!)

My personal preference is to start with option 1 and see if it helps. I suspect the spikes in traffic are not coming from WebThings users and if we cut off the source of the excessive traffic the service would hopefully go back to being stable again.

If anyone has experience of configuring rate limiting for pdns, I would be grateful for some help.