DoSomething / infrastructure

🐄 DoSomething.org's infrastructure, managed by Terraform.
MIT License
3 stars 2 forks source link

Simplify Fastly config & track changes in code. #39

Closed DFurnes closed 5 years ago

DFurnes commented 6 years ago

~BUG~ REQUEST

Background

From Matt's write-up in the Q3-Q4 technology memo:

We’ve built up a pretty large catalog of Fastly services, in many cases using a separate service per-environment and per-app. This increases the number of places a change must be made, and the chance that things might get out of sync between environments.

We’ve also had trouble with discoverability and the review process around updating VCLs for applications, as it’s not always clear to application developers when a change has been made or how it will impact their application. We should investigate simplifying our systems to rely on fewer separate configs (re: Fastly ALTITUDE talks from USA Today & Conde Nast), and investigate better processes or tooling around making changes more visible.

Current Behavior

We have 19(!!) Fastly properties, of which DoSomething.org (Phoenix & Ashes), API/Northstar, Rogue, GraphQL, and CatchAll (some redirects) receive the most use.

We also have 4 services for different voting app instances, a search property for Solr, Ashes Staging & Thor, and a property with vanity redirects for two campaigns. The remaining 7 don't seem to be receiving any traffic and can probably be deleted.

Desired Behavior

It'd be great to consolidate these into fewer properties so we can roll out changes across the board more easily (e.g. gzipping or geolocation headers). We should also have separate QA & production configs for our other services (and ideally, an easy way to promote changes from QA to prod).

Finally, it's not always clear what changes are made to our configs and why (e.g. if we forget to drop a note in #deploys), or whether a draft config is safe to push to production or should be discarded (which should be aided by Morgan's new discussions for the DoSomething.org and API/Northstar properties).

Suggested Solution

An easy first step is to audit our existing configs and clean them up! This includes removing unnecessary origins, conditions, custom VCLs, etc. We can also probably delete a bunch of those unused properties, or consolidate ones that are infrequently changed (like the voting apps).

I'd also like to experiment with configuring & tracking our Fastly config (and other infrastructure!) in code with Terraform. This builds off some of the work we've done with CloudFormation on Bertly, but lets us configure more things (like Fastly, but also AWS, Heroku, DNS, Papertrail, etc.) in one place.

Relevant Screenshots + Links

N/A

DFurnes commented 6 years ago

I've started experimenting with this in the DoSomething/infrastructure repository, starting with a new property to combine the existing CatchAll and Vanity geo-redirects properties.

DFurnes commented 6 years ago

I've deactivated the CatchAll and Vanity geo-redirects properties now that they've been consolidated into the "Terraform: Miscellaneous" property defined in code.

DFurnes commented 6 years ago

I'm going to deactivate a few more properties that aren't used anymore:

DFurnes commented 6 years ago

I'm consolidating the voting application services next – these feel like perfect test cases since we can consolidate 4 services into 2 (one prod, one QA), and they're not actively receiving any traffic so can safely be taken offline for a few moments to make the swap.

DFurnes commented 6 years ago

We've hit our limit of 20 services, so I've deleted the Avatar (deactivated above) & Four-Legged Finishers services, since that one is no longer being hosted and 503s when you visit the URL.

DFurnes commented 6 years ago

Took a brief detour to fix AGG's server, which needed a MySQL restart & re-creating a missing "Winner Category" in the DB so the winners would render. I've deactivated the Celebs Gone Good and Athletes Gone Good services and consolidated them in a new Terraform: Voting App service.

DFurnes commented 6 years ago

Since we're getting close on our service limit, I'm going to delete Celebs Gone Good, Athletes Gone Good, and Cats Gone Good (QA) now that they've been consolidated in the two new services.

DFurnes commented 6 years ago

Also deactivating Whitelabel, since it turns out we aren't using Fastly for that application (the whitelabel.dosomething.org domain points directly at an IP address).

sheyd commented 6 years ago

giphy

DFurnes commented 6 years ago

I added the GraphQL Heroku apps & Fastly property to Terraform, routing from new "Terraform: DoSomething" and "Terraform: DoSomething (QA)" Fastly properties. Since the applications are included in our Terraform config, this means we can hook up the wires between Fastly & Heroku automatically too which is pretty cool. 🔌

I imported the QA & production apps using terraform import (and some digging around Heroku's Platform API to get the right IDs), and destroyed & re-created the development application to test a fresh provision. This worked pretty darn seamlessly: hit terraform apply, set secrets (OAuth client details, encryption key, etc.), and run a deploy from master. Nice!

DFurnes commented 6 years ago

I've deleted the old GraphQL property as well now that it's been migrated to Terraform.

DFurnes commented 6 years ago

Next up, Ashes staging!

I've added the EC2 instance in https://github.com/DoSomething/infrastructure/commit/4c55d1c5276f4abfb467e7fb2a905744d6cc34c9 using Terraform's AWS provider with a read-only IAM user… this allows us to grab details (like IP addresses) from AWS but still be extra careful about making changes to our root organization. Then it's just a matter of adding the new backend to our "DoSomething (QA)" property in https://github.com/DoSomething/infrastructure/commit/66b15ecbd81e81e1dae6a3a2b036bfa897bf64bf

I also added a synthetic response for the robots.txt file in https://github.com/DoSomething/infrastructure/commit/48e18d019bf777d7c7601c2d445be4c153f176b4.

DFurnes commented 6 years ago

I've configured a remote backend that stores state in S3, and locks when changes are being made using a DynamoDB table. This uses the same IAM user created above, which has been given read/write access to those two resources. (See https://github.com/DoSomething/infrastructure/commit/249a445503ebe73099a0897a540dda7b91904b6a.)

I've also added a bucket policy to dosomething-infrastructure-state so only that IAM user can read from it (since it may contain sensitive configuration details). Finally, I put credentials into a "Terraform credentials" note in Lastpass in the Shared-DevOps folder.

DFurnes commented 6 years ago

I've added a response setting to force HTTPS on QA in https://github.com/DoSomething/infrastructure/commit/5a879c9db706f5accecf16428fbf69ce25ca8bad, and then on production (just GraphQL for now) in https://github.com/DoSomething/infrastructure/commit/f6cea3cdd80c1e1172cdf822abc44745eaa21c16. I also took the opportunity to test state locking by kicking off a parallel terraform apply in another tab… neat! 👌

screen shot 2018-08-28 at 3 15 47 pm
DFurnes commented 6 years ago

I took an additional pass at high-level documentation in the readme & root module in in case anyone wants to pick this up while I'm out on vacation next week (https://github.com/DoSomething/infrastructure/commit/9ea3ecac5bfda2f90fe272290cb164757d6c9625).

DFurnes commented 6 years ago

I've swapped Celebs Gone Good & Athletes Gone Good to point to static sites in #442 (using Terraform to provision the new buckets, I couldn't help myself!). Since the voting website wasn't functional, I just redirected Four Legged Finishers to the corresponding Ashes campaign.

DFurnes commented 6 years ago

I've deleted the staging.dosomething.org, Static-Assets, Subscriptions, and Whitelabel properties since we haven't seen any issues with them marked as inactive for the past week.

I've also stopped the Static-Assets i-245a15b4 t2.medium EC2 instance, since it's now unused.

DFurnes commented 6 years ago

I've enabled gzip on our voting app, QA, and production Fastly properties. Note that we have to explicitly set extensions & content types (using the suggested defaults from Fastly's documentation). While Fastly's admin interface will use defaults if blank, this is not the case in Terraform yet.

DFurnes commented 5 years ago

Alright, picking this work back up post-vacation! Next up, I'm going to migrate Northstar & Rogue's QA instances over to the new "Terraform: DoSomething (QA)" property. This will allow us to test out changes for these applications before deploying them to production. 👌

DFurnes commented 5 years ago

I added Northstar & Rogue's development/QA apps in https://github.com/DoSomething/infrastructure/commit/f6c0e62d99fa3f257600c28e84827c87f6ec6b38. Since these already existed, the Terraform resources also had to be paired up to the existing Heroku resources, as shown below for Northstar:

# import dosomething/northstar pipeline
curl -n "https://api.heroku.com/pipelines/northstar" -H "Accept: application/vnd.heroku+json; version=3"
terraform import module.shared.heroku_pipeline.northstar 63d72e97-fe04-4053-ba45-7e5c9bf00397

# dosomething-northstar-dev app & formations (e.g. dynos/processes):
terraform import module.dosomething-qa.module.northstar.heroku_app.northstar-dev dosomething-northstar-dev
terraform import module.dosomething-qa.module.northstar.heroku_formation.northstar-dev dosomething-northstar-dev:web
terraform import module.dosomething-qa.module.northstar.heroku_formation.northstar-dev-queue dosomething-northstar-dev:queue

# dosomething-northstar-dev domain:
curl -n https://api.heroku.com/apps/dosomething-northstar-dev/domains -H "Accept: application/vnd.heroku+json; version=3"
terraform import module.dosomething-qa.module.northstar.heroku_domain.northstar-dev dosomething-northstar-dev:28b829c9-2fee-447e-a3a8-bdd0a289dc86

# dosomething-northstar-dev log drain:
curl -n https://api.heroku.com/apps/dosomething-northstar-dev/log-drains -H "Accept: application/vnd.heroku+json; version=3"
terraform import module.dosomething-qa.module.northstar.heroku_drain.northstar-dev dosomething-northstar-dev:3b975105-dd49-4942-9947-8fa681d3e741

# dosomething-northstar-qa app & formations (e.g. dynos/processes):
terraform import module.dosomething-qa.module.northstar.heroku_app.northstar-qa dosomething-northstar-qa
terraform import module.dosomething-qa.module.northstar.heroku_formation.northstar-qa dosomething-northstar-qa:web
terraform import module.dosomething-qa.module.northstar.heroku_formation.northstar-qa-queue dosomething-northstar-qa:queue

# dosomething-northstar-qa domain:
curl -n https://api.heroku.com/apps/dosomething-northstar-qa/domains -H "Accept: application/vnd.heroku+json; version=3"
terraform import module.dosomething-qa.module.northstar.heroku_domain.northstar-qa dosomething-northstar-qa:2fb9c217-44bd-44ea-b832-a12c1c36708c

# dosomething-northstar-qa log drain:
curl -n https://api.heroku.com/apps/dosomething-northstar-qa/log-drains -H "Accept: application/vnd.heroku+json; version=3"
terraform import module.dosomething-qa.module.northstar.heroku_drain.northstar-qa dosomething-northstar-qa:122bd38e-180f-4f3d-8121-f6d72e212243

# ...and finally add each app to pipeline stages (using the pipeline ID from above):
curl -n https://api.heroku.com/pipelines/63d72e97-fe04-4053-ba45-7e5c9bf00397/pipeline-couplings -H "Accept: application/vnd.heroku+json; version=3"
terraform import module.dosomething-qa.module.northstar.heroku_pipeline_coupling.northstar-dev 8220db9c-2c0c-44fa-b9d9-8bdcc39c396e
terraform import module.dosomething-qa.module.northstar.heroku_pipeline_coupling.northstar-qa ca3b6d1c-ff3f-44a5-bd8a-15b2ddde30c1
DFurnes commented 5 years ago

I swapped Northstar & Rogue's development & QA apps over to the new Terraform QA Fastly property in https://github.com/DoSomething/infrastructure/commit/3fb6a53ca0b6a80abab70bffbb65b75da7032fca and https://github.com/DoSomething/infrastructure/commit/0d44a893f02c6d7c9157341ae7715be512889590. One surprise - we hit Fastly's default 5 origins per service limit. Luckily I was able to sort this out with a quick support request, and our new limit is 20 backends per service.

DFurnes commented 5 years ago

Shout out to @weerd for contributing to our redirect property via Dosomething/infrastructure#2! 🏆

DFurnes commented 5 years ago

Migrated over GDPR EEA redirects to the new production & QA properties in https://github.com/DoSomething/infrastructure/commit/6e5b7a16bd23efc4d3c57f3363e1ff230a09a5d2 and https://github.com/DoSomething/infrastructure/commit/b4a1d2eb90b14765aa2c5c30c23887433b1fd679.

DFurnes commented 5 years ago

I've renamed the "Terraform: Miscellaneous" property to "Terraform: Domain Redirects" in https://github.com/DoSomething/infrastructure/commit/a3c4e86813337efc5342aa2d6a67d719ca84387c for better clarity on what that's being used for.

DFurnes commented 5 years ago

I've added Northstar's production Heroku app to Terraform in https://github.com/DoSomething/infrastructure/commit/e918810dd2ace6bc6bf48ba8f501a3c55f0ececb. Pending DoSomething/northstar#797, I'm planning on moving Northstar from the "API/Northstar" property to the consolidated "Terraform: DoSomething" property on Monday.

DFurnes commented 5 years ago

I added Papertrail logging to the QA property in https://github.com/DoSomething/infrastructure/commit/a548619a55943d4a9a93bd0ddd886f19b996e6af, which gets us back to feature parity with the manually-configured properties. Since we haven't seen any issues, I'm going to swap over Northstar production next.

DFurnes commented 5 years ago

And that's done in https://github.com/DoSomething/infrastructure/commit/09c5c2aa0cd270122723af4ce7273392c1a83d8d! Everything went smoothly with the swap. I'll leave the old "API/Northstar" property inactive for a while before cleaning it out.

DFurnes commented 5 years ago

Added Rogue production resources in https://github.com/DoSomething/infrastructure/commit/e43cc2e438d7778256807951acaf8b60e1eda5e1.

DFurnes commented 5 years ago

And moved Rogue production over the the "Terraform: DoSomething" service in https://github.com/DoSomething/infrastructure/commit/e9b3ad0b4fd69691a5d2d3226d195644efb9c9a0 (and executed here here).

Note: I had to manually apply the FASTLY_SERVICE_ID environment variable to the Heroku app, because applying it with Terraform introduces a cyclic dependency (since the Fastly service relies on Heroku outputs like app name & domain, and Heroku would then rely on the Fastly service's ID output… kaboom!) Looks like something we may be able to solve with terraform-provider-heroku#47 one day.

DFurnes commented 5 years ago

Alright, this feels like a good place to close this issue out:

The two big remaining properties are also likely our hairiest – Phoenix/Ashes QA & production. I'm going to split the work of cleaning these up into a separate ticket since they're uniquely complicated right now, and I'd like to make some more drastic changes there. I'm going to leave the "Search" Solr property untouched since we'll likely be retiring that within the quarter.

sheyd commented 5 years ago

:1st_place_medal: :100: :tada: :taco: :rocket: :shipit: :balloon: :heart: :cake: