Open moonshiner opened 1 year ago
discuss BGP fallback issues - announcing a /23 via BGP and as a fallback statically pin up a /24 as fallbacks
Should have 2 (or maybe more) IP addresses per address family.
Ideally addresses from different RIRs. The threat is dealing with failures at an RIR, either governance, security, or technical.
Can comment on some HA design stuff with L4 balancers
Note to TF: figure out how to recommend a number that sticks in people's heads (1.1.1.1 is an example, but we don't want to recommend everyone fight over every /8-based address)
Publishing a list of back-end addresses used for resolving can be useful for other network & DNS operators (for example, geo-IP location, making sure data is getting to correct places, and so on).
Sign RPKI routes, validating is more optional discuss cost / benefit of validating (less spoofed traffic but cost of router cycles)
Capacity is definitely network depeendent. Considerations to run you servers at 30% capacity.
Capacity
CPU/network Multi-layer caching How to estimate Resilience
Diversity of software, geography, toplogy. Bare metal vs. VM vs. containers, self-hosted vs. hosted vs. cloud
build your base as a percentage of usage.
Gather baseline by throughput of resolvers
gather performance numbers from vendors. (Knot resolver performance numbers)
multi-layer resolver - caching at resolver ; then gateway consider operational considerations
Will need to about Bare metal vs. VM vs. containers, self-hosted vs. hosted vs. cloud
how things are built/topology
Running Containers on bare metals Bottleneck of the network scaling Can cloud providers autoscale relatively well cost around running in containers newly built instances would have empty DNS caches
BAre Metal vs VMs vs Containers
Bottom Line - guidance needed on running DNS in containers, but recommendations would be bare metal, and/or appliance
use of network appliances for dns load balancing with slow back end servers
If hardware limited performance wise (Ie, vms), use network appliances to assist in caching, otherwise bare metal servers better performance
Diversity of software - use 2 different vendors. familiar with other vendors, have plans to work out migrations in case of emergency.,
With s/w vendors, care with updates across the platform. Should have plans on software updates are deployed slowly at first to watch for issues, and have roll back plans in place. Ideas around operational best practices.
Also of geography/topology - closer to user base as possible.
Some candidate text on what we discussed today:
All DNS resolver software can run either on dedicated servers (rented or colocated), or in virtualized clouds, or in a combination of those. Every approach has pros and cons. Most of these are not specific to running DNS resolvers, however, some of them are.
Running DNS resolver instances as OS level daemons on bare metal hosts:
Pros:
Performance: Bare metal servers have direct access to the underlying hardware, and can offer superior performance/cost balance by avoiding the overhead associated with virtualization. Moreover, you have full control over the server's configurations, down to the hardware level, which can be beneficial for performance and cost optimization once you get the understanding of your typical work load during peak hours.
Data Security: Since you're in control of the physical servers, there's no risk of data leakage that can occur due to vulnerabilities in multi-tenant virtualization platforms, including CPU cache-based side-channel vulnerabilities. It could be argued that attacks targeting such issues are rare, and their impact on a DNS resolver service is low, but potential breaches may have significant privacy impact. It is advised to evaluate this against your organisation's risk model, or to discuss this with your information security compliance experts.
Predictability: Because there's no virtualization layer and no "noisy neighbours" on the host, the performance of your servers is more predictable.
Cons:
Cost of failure: If you pick hardware configuration that is not optimal for the workload of your DNS resolver, you may need to upgrade and replace hardware components afterwards. Ways to reduce this risk include renting servers instead of buying them, carrying load testing with data similar to production workloads, and providing limited beta access to the service before it fully enters the production phase.
Scalability: Scaling up with physical servers means acquiring or renting, installing, and configuring new hardware, which will take more time than provisioning new virtual servers in a cloud environment. Moreover, most cloud environments will provide you with cluster autoscaling features, which could barely be achieved in bare metal.
Maintenance: You'll be responsible for all server maintenance tasks, including hardware issues, which can require significant effort and specific expertise.
Redundancy: Setting up high availability and disaster recovery strategies can be more complex and time consuming compared to the cloud, where these features are often provided as value added products. See the Redundancy section for more details.
Running DNS resolver instances in containers in a public cloud:
Pros:
Scalability: Clouds excel at scaling applications. You can scale up and down rapidly based on load, which is important for a DNS resolver that needs to handle variable query loads. In case of regional or geographically distributed resolvers, in every region where the resolver would be deployed, daily periodicity is likely to be observed, e.g. peak hour is likely to occur around 19:00 local time, and off-peak hours may begin at around 1:00-3:00 AM. In a situation like that, using cluster autoscaling features and tools, you can run less instances in the night and more instances throughout the day, which may help to optimize your cloud hosting costs.
Fault Tolerance and High Availability: Most clouds has built-in strategies, features and products for handling node failures, which can increase your service's availability.
Deployment and Management: Cloud providers offer built-in methods to deploy and manage applications, which can simplify operations and reduce the likelihood of human errors if your infrastructure management department is already familiar with these tools.
Cost: While this largely depends on your specific usage, cloud services can sometimes be more cost-effective than managing your own physical servers, especially when you consider the total cost of ownership, including power, cooling, and maintenance.
Cons:
Performance: The virtualization layer of public clouds can impact performance. While this certainly could be mitigated through scaling the number of virtual hosts, the cost would also increase accordingly.
Complexity: Advanced cloud technologies are complex systems which come with a steep learning curve. Without prior experience, properly configuring and managing a cloud-based compute cluster can be challenging.
Cost Variability: While the cloud can be cheaper, it can also be more expensive if not properly managed. Costs can rise unexpectedly based on traffic. Make sure to always set some limits on how much may be spent on hosting in the cloud control panel, and to set up notifications to be sent to you when these thresholds are about to be triggered.
Multi-tenancy Risks: In a public cloud environment, the "noisy neighbor" problem could potentially affect your service's performance. Additionally, even though cloud providers take steps to isolate tenant environments, vulnerabilities could potentially expose sensitive data (see the previous section for a detailed explanation).
Additional considerations
In today's environments, Kubernetes and Terraform are sometimes used as a substitute for cloud APIs when it comes to production services' management. When running a DNS resolver in a Kubernetes cluster on top of a public cloud environment, all the pros and cons of the public cloud apply; basically, Kubernetes becomes your public cloud provider. If you have significant prior experience running services in Kubernetes in production, you may successfully replicate your experience with the DNS resolver software. Otherwise, we would advise against Kubernetes in this case.
The only reason we may find to run a DNS resolver in a Kubernetes cluster on top of self-hosted dedicated servers is when you have significant hands-on experience with Kubernetes and it is natural for you to manage applications this way. Otherwise, running DNS resolver daemons in containers brings little, if any, benefit. Autoscaling features are not available to you in this case, and neither horizontal nor vertical pod autoscaling is of any use, because DNS resolver software typically scales in-host by itself just fine.
When designing a cluster of resolvers for autoscaling, keep in mind that newly spawned resolver machines would need to populate resolver cache first before they are fully useful. Your DNS resolver software may provide cache replication mechanisms. Otherwise, it is safe to overprovision clusters somewhat under heavy load, and discarding excessive instances once all the caches are populated and the average load of a compute instance decreases.
It is always advised to prefer environments your infrastructure management team is familiar with.
We should attempt to seperate the network portions from the system portions where we can
For example:
Care should be taken to allocate the IP Networks used for public resolvers.
Both IPv4 and IPv6 MUST be deployed.
Egress Filtering should be done to follow BCP38
Additionally, sign route advertisements using RPKI
To be robust, using anycast to announce network routes should be used.
Not deploying some solution to announce networks in multiple locations will incur system performance. (what is this?)
To add further resiliency, multiple network allocations in different RIRs should be considered; especially if you have global operations.