datacite / datacite

Issues and milestones for the DataCite organization
https://www.datacite.org
MIT License
44 stars 8 forks source link

Research terraform upgrade process #1956

Closed richardhallett closed 11 months ago

richardhallett commented 11 months ago

Research how best to handle the upgrade process, there is existing documentation and thoughts on how this can be achieved. This issue is how best to proceed but also how to break down the work into manageable chunks.

richardhallett commented 11 months ago

General thoughts:

Other questions with thoughts

  1. Why are we using terraform cloud? what do we gain?
    • We can organise modules within the private registry including which public ones we recommend, but im not sure why thats useful as all we have should be public.
    • State management is the big one and the automated processes associated with terraform cloud.
  2. Where should we be managing aws route53 dns records?
    1. Per service configuration as some services may not require dns
  3. How should we manage new versions of software releases with terraform?
    • This is actually configuration management, because you're changing the software version that is used, but the actual infrastructure is staying the same.
    • Task definitions in terraform shouldn't be controlled and configured here.
    • Task definitions in ECS are immutable, so we should still always create a new task definition
    • Instead consider to use regular aws cli to update the task definition with the new value. Also the possibility of using a tool like: https://github.com/fabfuel/ecs-deploy
    • Setup dummy task definition within terraform, ignore all changes to this. via lifecycle ignore_changes
    • Put the task definitions with the application code and set per environment as required. then let CI i.e. github actions do the deployment to AWS.
    • https://docs.github.com/en/actions/deployment/deploying-to-your-cloud-provider/deploying-to-amazon-elastic-container-service
    • How do you roll back?
    • Rolling back would be the previous github tagged version because it would then update the task definition to match.
    • How do we handle the environment variables set for a task configuration?
    • The application specific ones should be defined in the CI tool, this is configuration management again. So we'd remove them from terraform.
richardhallett commented 11 months ago

Upgrade Plan

Phase 1 - Auxiliary terraform configuration

  1. Global

    • Cognito - Not used, suggest remove aws resources and kill terraform workspace.
    • DNS - Various dns entries and zones, needs updating to latest syntax and syncing state
    • Github - Remove, this was used for managing labels however we're now quite out of sync, instead this perhaps could be future replaced with dedicated tooling like https://github.com/Financial-Times/github-label-sync/
    • Google - Just DNS records, needs syntax updates and syncing
    • iam-ng - Old IAM config - remove
  2. Prod-EU-West VPC

    • Upgrade from 0.12
    • VPC Config - Presently configured via datacite/ops and is not in state sync. Instead just update any references and syntax
    • WAF - Update terraform syntax, remove unused
    • IAM Config - Presently configured via datacite/ops and is not in state sync. Instead just update any references and syntax
  3. Stage VPC

    • Upgrade from 0.11
    • ELB - Upgrading and moving to use standard module
  4. Test VPC

    • Upgrade from 0.12
    • ELB - Upgrading and moving to use standard module
  5. Cleanup old dev setup from structure

    • Destroy and remove all of dev workspace in config and terraform cloud. This was an experiment to use kubernetes for a dev cluster, however it was never released and now is stale. This idea could be better handled again in the future.

Phase 2: Upgrade all application services

These services vary between 0.11 to 0.12, they need to be upgraded in turn and ensure their state matches what is deployed.

  1. Production services akita analytics analytics-api api assets bastion blog check-indexed-dois check-links cheetoh citation client-api content-negotiation crossref-agent crossref-orcid-agent crossref-related-agent datadog datafiles-generator - This is running but I think it is unused and wasting resources, upgrade but disable. delete-test-dois doi federation ftp - To delete homepage http-redirect levriero mds message-queue metrics-api oai pidcheck profiles raw-resolution-logs re3data repository-finder salesforce-api schema search sitemaps-generator stats-portal store-crawler-results strapi - This is cms.datacite.org - It only exists to update a list of service providers on homepage via url https://cms.datacite.org/service-providers

  2. Stage services

akita analytics analytics-api api assets bastion blog - Needs tidying to remove old s3 buckets check-indexed-dois check-links cheetoh clickhouse-ebs client-api content-negotiation datafiles-generator - This is running but I think it is unused and wasting resources, upgrade but disable. delete-test-dois demorepo doi federation handle homepage - Needs tidying to remove old s3 buckets http-redirect levriero mds message-queue metrics-api oai pidcheck profiles re3data repository-finder resolution-logs-pipeline salesforce-api schema search sitemaps-generator slides stats-portal store-crawler-results strapi

  1. Test Services

assets cheetoh client-api doi handle http-redirect mds message-queue metrics-api

Phase 3: Upgrade other hosted services.

  1. Openingscience - Public site of book.openingscience.org - Needs upgrading from 0.11 to latest
  2. Pidapalooza - Public site pidapalooza.org - Needs upgrading from 0.11 to latest
  3. pidnotebooks - www.pidnotebooks.org - This is just now dns records, move it out into global dns records, no need for seperate workspace.
  4. scholix - Collection of DNS records for www.scholix.org, potentially rename folder structure but mainly needs syntax updating from 0.11

Phase 4: Upgrade data-storage configs

  1. clickhouse - prod
  2. clickhouse - stage
  3. efs - prod
  4. efs - stage
  5. elasticsearch - prod
  6. elasticsearch - stage
  7. memcached - prod
  8. memcached - stage
  9. mysql - prod
  10. mysql - stage
  11. postgresql - prod - Delete as unused
  12. postgresql - stage - Delete as unused
  13. redis - prod
  14. redis - stage

Phase 5: datacite/ops and AWS cleanup

This needs investigating to ensure all AWS resources have configuration detailed in mastino. Careful attention needs to be paid to anything that maybe private. Some things are configured via old datacite/ops private repository.

  1. Production VPC https://github.com/datacite/datacite/issues/1641
  2. Production ELB
  3. Production IAM - Exists in datacite/ops
  4. Production ECS Cluster config
  5. Stage VPC

Phase 6 (Optional): Deployment workflow

Redefine github workflows to instead deploy task definitions from those defined in application code, making terraform code just infrastructure and not config management.

richardhallett commented 11 months ago

Additional note discovered during upgrade testing is we may need to do a process for upgrading the AWS provider version at the same time. See: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/version-3-upgrade https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/version-4-upgrade https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/version-5-upgrade