breaking out the systems and networks section

moonshiner commented 1 year ago

We should attempt to seperate the network portions from the system portions where we can

For example:

Network Resiliency

Care should be taken to allocate the IP Networks used for public resolvers.

Both IPv4 and IPv6 MUST be deployed.

NOTE: if IPv6 is not available, then IPv6 is not available

Egress Filtering should be done to follow BCP38

Additionally, sign route advertisements using RPKI

RPKI validation is also possible, although the effort is greater. (May ask your hosting/transit provider if they validate.)

To be robust, using anycast to announce network routes should be used.

Not deploying some solution to announce networks in multiple locations will incur system performance. (what is this?)

To add further resiliency, multiple network allocations in different RIRs should be considered; especially if you have global operations.

moonshiner commented 1 year ago

discuss BGP fallback issues - announcing a /23 via BGP and as a fallback statically pin up a /24 as fallbacks

shane-ns1 commented 1 year ago

Should have 2 (or maybe more) IP addresses per address family.

Ideally addresses from different RIRs. The threat is dealing with failures at an RIR, either governance, security, or technical.

moonshiner commented 1 year ago

Can comment on some HA design stuff with L4 balancers

shane-ns1 commented 1 year ago

Note to TF: figure out how to recommend a number that sticks in people's heads (1.1.1.1 is an example, but we don't want to recommend everyone fight over every /8-based address)

shane-ns1 commented 1 year ago

Publishing a list of back-end addresses used for resolving can be useful for other network & DNS operators (for example, geo-IP location, making sure data is getting to correct places, and so on).

moonshiner commented 1 year ago

Sign RPKI routes, validating is more optional discuss cost / benefit of validating (less spoofed traffic but cost of router cycles)

moonshiner commented 1 year ago

Capacity is definitely network depeendent. Considerations to run you servers at 30% capacity.

Capacity

CPU/network Multi-layer caching How to estimate Resilience

Diversity of software, geography, toplogy. Bare metal vs. VM vs. containers, self-hosted vs. hosted vs. cloud

moonshiner commented 1 year ago

build your base as a percentage of usage.
Gather baseline by throughput of resolvers

gather performance numbers from vendors. (Knot resolver performance numbers)

multi-layer resolver - caching at resolver ; then gateway consider operational considerations

moonshiner commented 1 year ago

Will need to about Bare metal vs. VM vs. containers, self-hosted vs. hosted vs. cloud

how things are built/topology

Running Containers on bare metals Bottleneck of the network scaling Can cloud providers autoscale relatively well cost around running in containers newly built instances would have empty DNS caches

BAre Metal vs VMs vs Containers

Containers on bare metal have

Bottom Line - guidance needed on running DNS in containers, but recommendations would be bare metal, and/or appliance

use of network appliances for dns load balancing with slow back end servers

moonshiner commented 1 year ago

If hardware limited performance wise (Ie, vms), use network appliances to assist in caching, otherwise bare metal servers better performance

Diversity of software - use 2 different vendors. familiar with other vendors, have plans to work out migrations in case of emergency.,

With s/w vendors, care with updates across the platform. Should have plans on software updates are deployed slowly at first to watch for issues, and have roll back plans in place. Ideas around operational best practices.

Also of geography/topology - closer to user base as possible.

ximaera commented 1 year ago

Some candidate text on what we discussed today:

Infrastructure considerations

Bare metal or public cloud

All DNS resolver software can run either on dedicated servers (rented or colocated), or in virtualized clouds, or in a combination of those. Every approach has pros and cons. Most of these are not specific to running DNS resolvers, however, some of them are.

Running DNS resolver instances as OS level daemons on bare metal hosts:

Pros:

Performance: Bare metal servers have direct access to the underlying hardware, and can offer superior performance/cost balance by avoiding the overhead associated with virtualization. Moreover, you have full control over the server's configurations, down to the hardware level, which can be beneficial for performance and cost optimization once you get the understanding of your typical work load during peak hours.
Data Security: Since you're in control of the physical servers, there's no risk of data leakage that can occur due to vulnerabilities in multi-tenant virtualization platforms, including CPU cache-based side-channel vulnerabilities. It could be argued that attacks targeting such issues are rare, and their impact on a DNS resolver service is low, but potential breaches may have significant privacy impact. It is advised to evaluate this against your organisation's risk model, or to discuss this with your information security compliance experts.
Predictability: Because there's no virtualization layer and no "noisy neighbours" on the host, the performance of your servers is more predictable.

Cons:

Cost of failure: If you pick hardware configuration that is not optimal for the workload of your DNS resolver, you may need to upgrade and replace hardware components afterwards. Ways to reduce this risk include renting servers instead of buying them, carrying load testing with data similar to production workloads, and providing limited beta access to the service before it fully enters the production phase.
Scalability: Scaling up with physical servers means acquiring or renting, installing, and configuring new hardware, which will take more time than provisioning new virtual servers in a cloud environment. Moreover, most cloud environments will provide you with cluster autoscaling features, which could barely be achieved in bare metal.
Maintenance: You'll be responsible for all server maintenance tasks, including hardware issues, which can require significant effort and specific expertise.
Redundancy: Setting up high availability and disaster recovery strategies can be more complex and time consuming compared to the cloud, where these features are often provided as value added products. See the Redundancy section for more details.

Running DNS resolver instances in containers in a public cloud:

Pros:

Scalability: Clouds excel at scaling applications. You can scale up and down rapidly based on load, which is important for a DNS resolver that needs to handle variable query loads. In case of regional or geographically distributed resolvers, in every region where the resolver would be deployed, daily periodicity is likely to be observed, e.g. peak hour is likely to occur around 19:00 local time, and off-peak hours may begin at around 1:00-3:00 AM. In a situation like that, using cluster autoscaling features and tools, you can run less instances in the night and more instances throughout the day, which may help to optimize your cloud hosting costs.
Fault Tolerance and High Availability: Most clouds has built-in strategies, features and products for handling node failures, which can increase your service's availability.
Deployment and Management: Cloud providers offer built-in methods to deploy and manage applications, which can simplify operations and reduce the likelihood of human errors if your infrastructure management department is already familiar with these tools.
Cost: While this largely depends on your specific usage, cloud services can sometimes be more cost-effective than managing your own physical servers, especially when you consider the total cost of ownership, including power, cooling, and maintenance.

Cons:

Performance: The virtualization layer of public clouds can impact performance. While this certainly could be mitigated through scaling the number of virtual hosts, the cost would also increase accordingly.
Complexity: Advanced cloud technologies are complex systems which come with a steep learning curve. Without prior experience, properly configuring and managing a cloud-based compute cluster can be challenging.
Cost Variability: While the cloud can be cheaper, it can also be more expensive if not properly managed. Costs can rise unexpectedly based on traffic. Make sure to always set some limits on how much may be spent on hosting in the cloud control panel, and to set up notifications to be sent to you when these thresholds are about to be triggered.
Multi-tenancy Risks: In a public cloud environment, the "noisy neighbor" problem could potentially affect your service's performance. Additionally, even though cloud providers take steps to isolate tenant environments, vulnerabilities could potentially expose sensitive data (see the previous section for a detailed explanation).

Additional considerations

In today's environments, Kubernetes and Terraform are sometimes used as a substitute for cloud APIs when it comes to production services' management. When running a DNS resolver in a Kubernetes cluster on top of a public cloud environment, all the pros and cons of the public cloud apply; basically, Kubernetes becomes your public cloud provider. If you have significant prior experience running services in Kubernetes in production, you may successfully replicate your experience with the DNS resolver software. Otherwise, we would advise against Kubernetes in this case.
The only reason we may find to run a DNS resolver in a Kubernetes cluster on top of self-hosted dedicated servers is when you have significant hands-on experience with Kubernetes and it is natural for you to manage applications this way. Otherwise, running DNS resolver daemons in containers brings little, if any, benefit. Autoscaling features are not available to you in this case, and neither horizontal nor vertical pod autoscaling is of any use, because DNS resolver software typically scales in-host by itself just fine.
When designing a cluster of resolvers for autoscaling, keep in mind that newly spawned resolver machines would need to populate resolver cache first before they are fully useful. Your DNS resolver software may provide cache replication mechanisms. Otherwise, it is safe to overprovision clusters somewhat under heavy load, and discarding excessive instances once all the caches are populated and the average load of a compute instance decreases.
It is always advised to prefer environments your infrastructure management team is familiar with.

DNS-Resolver-BCP-TF / Resolver-Recommendations

breaking out the systems and networks section #21

Infrastructure considerations

Bare metal or public cloud