cloud-gov / product

Program-level artifacts, workflow and issues for cloud.gov
Creative Commons Zero v1.0 Universal
31 stars 15 forks source link

CDN broker's current DNS pattern is brittle #462

Closed konklone closed 4 years ago

konklone commented 7 years ago

Right now the CDN broker gives the user/customer a CNAME record, to point the intended hostname at the cloudfront.net hostname corresponding to the distribution cloud.gov creates for the user. So for example, Cloud.gov might ask the owner of cio.gov to CNAME pulse.cio.gov to ksdhfalksdhjf.cloudfront.net.

While an architecturally simple system, this has some significant maintainability issues and puts cloud.gov in a very dangerous situation.

A CloudFront distribution should be like any other cloud resource, managed as cattle-not-pets and should be able to be recreated at any time based on an underlying specification. However, in the current system, if the CloudFront distribution is ever destroyed or recreated, the production application will be broken in a way that Cloud.gov can't repair without manual outside intervention by the owner of the upstream CNAME record.

I don't think this meets cloud.gov's overall design goals or philosophy, and it constrains how Cloud.gov will be able to manage CloudFront distributions in the future. Cloud.gov could find itself in a situation where it can't make changes to the underlying implementation without reaching out to every user and getting them to reach out to the necessary DNS owners (who are often/generally not the same office, or even agency, as the cloud.gov user).

Possible solution: intermediary DNS

Instead, cloud.gov could maintain an intermediary DNS record, with a one-to-one mapping to app names, created when a CloudFront distribution is created. This should be relatively straightforward, as there should be no situation where an app's name and state need to diverge from the DNS record.

So when I create an app called analytics, intended for https://analytics.usa.gov, and then I go through the CDN broker:

Then, Cloud.gov is free to alter its underlying implementation at will, or to recreate resources as necessary, as long as the underlying DNS for what analytics.dns.cloud.gov points to is managed at the same time.

While I understand it's nice to not have to manage DNS records for customers, the current approach seems destined to create long-term frustration for the Cloud.gov team (who will be stuck with CloudFront distributions that must never go away) and for users (who will potentially need to go back to often-distant and unresponsive DNS owners to make record changes if underlying commodity resources change, or their HTTPS implementation strategy changes).

Alternate proposal: one constant intermediary subdomain somehow?

Some providers use a single intermediary subdomain for the CNAME. For example, Tumblr has users CNAME their domain to domains.tumblr.com. However, I think this is possible because Tumblr routes all requests to the same CDN endpoints and then serves certificates/sites smartly at that layer.

In our case, we're routing requests to unique CDN endpoints (CloudFront distributions) depending on the hostname the user has pointed to Cloud.gov, so it's not clear to me how we'd accomplish the same thing without a major implementation change.

Other proposals?

These are just the first approaches that came to mind. Are there other ways of addressing the same underlying goals?

konklone commented 7 years ago

Another idea is, for the intermediary CNAME, to reuse the existing domain name applications already get, e.g. analytics.fr.cloud.gov.

I understand that the subdomain is handled at the DNS level by a wildcard record, so it would still mean creating a specific DNS record that isn't created today. But it's a highly usable choice because it will make intuitive sense to the customer, and doesn't mean creating a separate zone just for DNS CNAMES. You already are committing to managing the relationship between that name and the application in question, so it shouldn't create a lot of new issues around name management and sync.

konklone commented 7 years ago

I'd like to at least get some thoughts from someone on the cloud.gov team on this. I think it has serious ramifications for cloud.gov's long-term maintainability and user experience. cc @jmcarp

jmcarp commented 7 years ago

Maybe we can have a quick chat with @dlapiduz about this sometime. For now, I'm just wondering when and why we might want to reprovision cloudfront distributions--agreed that flexibility is good, but I want to understand when we're likely to need it.

Also discussed this briefly in #cloud-gov-agent-q at https://gsa-tts.slack.com/archives/cloud-gov-agent-q/p1478023039000167.

konklone commented 7 years ago

I don't think you'd often want to reprovision cloudfront distributions in normal usage. It might come up as part of some comprehensive change cloud.gov wanted to make to how CDN termination is managed.

However, the way I'm most concerned that it could come up is simply human error. A cloud.gov admin could delete the distro -- or, perhaps more likely, a team member on the project might accidentally delete the distro. cloud.gov services are, by design, very easy to create and delete.

While this kind of error is possible for all cloud.gov services, the impact of deleting a cloudfront distro is anomalously severe, because with a CNAME pointed by the partner agency that contains the auto-generated cloudfront distro name hardcoded into partner DNS, there is nothing that cloud.gov users or admins can do to fix this directly, no matter how much it's escalated internally. The only way to fix it would be to escalate it with the partner agency for emergency DNS changes, or to escalate it with Amazon for emergency manual reassignment of cloudfront hostnames. During either of these escalations, the application would experience complete downtime.

I was reminded to update this thread because I've been waiting for 2 months now for an agency to set the CNAME records we (@gbinal and I) sent them to migrate our applications. This isn't actually the partner agency -- the domain in question (cio.gov) is owned by our partner but has its DNS managed by a separate office. We've been aggressively poking and escalating, and no action has been forthcoming. I am not convinced that even downtime would create urgency on the side of the DNS owner, especially because of the fact that the DNS owner is not the partner agency.

This makes me pretty terrified that if there was ever any sort of cloud.gov error, on the user or admin side, that downtime would be sustained and severe. If we're not going to ask people to delegate their DNS using NS records, we should at least still give ourselves the ability to resolve any issues with our internal infrastructure internally.

konklone commented 7 years ago

The linked conversation emphasizes that "in the ⁠⁠real world⁠⁠ DNS changes that happen very infrequently are not a huge problem". However, I'm not expressing concern about DNS changes causing disruption -- I'm expressing concern about unexpected changes to cloud.gov infrastructure causing disruption (and thus necessitating a DNS change).

cc @dlapiduz

wslack commented 7 years ago

I'll chime in on this issue. As Federalist's owner, I'm a little terrified of the CDN broker because my customers don't control their own DNS. They have to wait for a ticketing system, and having to destroy a CDN route to make a new one requires downtime.

I feel like absent cloud.gov changes, Federalist will need to make a NGINX proxy that can be manually loaded with a cert and used while CDN config is being changed.

konklone commented 7 years ago

I feel like absent cloud.gov changes, Federalist will need to make a NGINX proxy that can be manually loaded with a cert and used while CDN config is being changed.

And that would have significant downsides associated with it, from an operations and security standpoint (in terms of managing key material), and would be difficult to pull off in anything approaching an automated fashion.

wslack commented 7 years ago

Well, given that cloudfront can only have one thing exist for a domain at a time, it seems like we need some sort of intermediate CDN-lite service and the ability to generate a certificate at will.

Note - I don't think cloud.gov needs to provide these. They might be better as 18F infra priorities.

konklone commented 7 years ago

Has the cloud.gov team been able to decide where this fits into the project's priorities?

I remain deeply concerned about the continued possibility for emergency downtime situations, caused by further inevitable errors around removing what are supposed to be fungible and recreatable AWS resources.

The more CNAMEs with specific AWS resource names in them that we hand out to external DNS teams to set, the more dependencies we create that are outside of our control and whose disruption would result in immediate downtime requiring human intervention to repair.

mogul commented 7 years ago

We talked about this in our quarterly planning meeting last week. We're going to try to fit in work on it around other high-level objectives this quarter, but it's lower priority and not aligned to other objectives we set.