ECS Milestone 1 Feedback

mcasperson commented 3 years ago

Please add a comment to this issue with your feedback regarding the proposed ECS integration described here.

We have a few general areas that we are actively seeking feedback on, listed below. But any feedback is welcome.

Will the proposed step and target work for your ECS deployments?
What does your ECS architecture look like?
Do you have multiple clusters?
Do you have multiple AWS accounts?
What kinds of applications are you deploying?
What ECS deployment challenges do you wish Octopus could solve for you?

mhudson commented 3 years ago

We firmly support this RFC!

Will the proposed step and target will work for your ECS deployments? YES
What does your ECS architecture look like? ECS Fargate with each Cluster running >10 services that sits behind a NLB (and behind APIG). Each service runs a Task Definition with 1-3 containers. Initial deploy by CloudFormation, and subsequent deploys by Powershell using the old Octopus blog post as a hint.
Do you have multiple clusters? Yes, one cluster per environment (test/qa/uat/prod), defined in CloudFormation.
Do you have multiple AWS accounts? Yes, numerous. Each (financial services) client gets their own account.
What kinds of applications are you deploying? RESTful APIs running on dot net core/ kestrel as microservices to our financial advice platform. A few are internal microservices eg async message handling etc.
What ECS deployment challenges do you wish Octopus could solve for you? a. Get rid of the custom powershell b. Ideally be deployment aware. Today, the PS is a fire and forget Task Definition deployment. It doesn't know if the deployment is successful or not.

Thanks team - look forward to having this feature in the Octopus arsenal

qnm commented 3 years ago

Will the proposed step and target will work for your ECS deployments?

Yes, I think so.

What does your ECS architecture looks like?

Fargate based, multiple services (~20), some accessible via an ALB. We use Convox to manage deployments currently, which manages the creation of new tasks and services, and the deployment of them. We are considering migration away from Convox and think Octopus could be a good option with ECS support.

Do you have multiple clusters?

One per environment.

Do you have multiple AWS accounts?

Yes - for production and non-production.

What kinds of applications are you deploying?

Some RESTful APIs, GraphQL APIs, Background Job Queue Workers.

What ECS deployment challenges do you wish Octopus could solve for you?

I'm looking for visibility from GitHub commits to deployments, which is missing right now with Convox.

Hawxy commented 3 years ago

Will the proposed step and target will work for your ECS deployments? Yes I think so.

What does your ECS architecture looks like? Fargate sitting behind an ALB. IaC via CDK v2. Currently deploying via Github actions as we aren't planning to move it into Octopus for a few months.

Do you have multiple clusters? One per product/vertical.

Do you have multiple AWS accounts? One per product/team & environment.

What kinds of applications are you deploying? RESTful + gRPC APIs.

What ECS deployment challenges do you wish Octopus could solve for you? Complete visibility over version management and the deployment lifecycle across our environments.

mcasperson commented 3 years ago

@qnm Thanks for that feedback. Git commits can be supplied by the Build Information feature in Octopus. Basically how it works is the build server (Jenkins, TeamCity, GitHub actions etc) upload a build information package to Octopus associated with a package (or Docker image in this case). The build information includes Git commits, which are then used in a deployment (e.g. sending an email) or simply displayed in the Octopus web console.

Chapter 5 "Visible Deployments" of our free ebook provides an example in the context of a Kubernetes deployment, but the same logic would apply to any container management system. An example is shown below - see the Commits section associated with the Docker image.

Would this meet your use case?

fdalyroomex commented 3 years ago

I have a query on the use of targets for providing the name of the ECS cluster instead of just using an environment-scoped variable? I'm sure you've already been through the discussion. Is it because once the cluster has been added as a target, that the issue of authentication, IAM keys, etc, doesn't have to be handled by the deployment steps?

Is it envisaged that each cluster as a deployment target will consume a licence?

Will the proposed step and target will work for your ECS deployments?

Not at this milestone; We're not using Fargate.

What does your ECS architecture looks like?

EC2 running multiple services, mixture of Linux & Windows hosts, some behind ALBs

Do you have multiple clusters?

Yes, one per environment, team & architecture

Do you have multiple AWS accounts?

Yes.

What kinds of applications are you deploying?

Various. Some .NET-based APIs, some Angular apps, some standalone tools & scheduled tasks.

What ECS deployment challenges do you wish Octopus could solve for you?

Mostly, the ability to create a service & task definition from scratch. At present they have to be set up in a semi-manual way, which is fraught. A reliable way of inserting/updating environment variables into the task definition.

andycherry commented 3 years ago

Will the proposed step and target work for your ECS deployments? No, we don't use fargate, we also have autoscaling on some services, and we have some scheduled tasks. Also our current PS scripts running the CF templates in, means that the CF templates are in source control, which I dont think will be the case with the proposed solution.

What does your ECS architecture look like? linux EC2's behind alb's

Do you have multiple clusters? Yes, a few clusters in each environment (production, staging, dev), though for dev we target just a single cluster

Do you have multiple AWS accounts? Yes, separating environments

What kinds of applications are you deploying? front end apps, api's, queue processors and scheduled tasks

What ECS deployment challenges do you wish Octopus could solve for you? We have custom PS managing running the CF templates, so not managing custom PS. We also have a disconnect between the CF templates taking longer to fail, than the octopus deployment. Resulting in octopus being ready to try again, but it will fail due to the CF template still trying.

mcasperson commented 3 years ago

@fdalyroomex

Is it because once the cluster has been added as a target, that the issue of authentication, IAM keys, etc, doesn't have to be handled by the deployment steps?

This is a big part of the decision. ECS targets are relatively simple compared to other targets, but even ECS targets require a very specific combination of account, worker, region, and cluster. If these values were on the step, these 4 fields across 3 environments would mean scoping 12 variables just right to get a successful deployment. This becomes unmanageable very quickly with more environments and tenants.

Targets also allow us to model cross region HA deployments easily. For example, you could have 2 targets, one in us-east-1, and one in us-west-1. A single step could trivially deploy to both targets, whereas if the target fields were on the step you would have to copy the step or model each region as a tenant, which is not all that intuitive.

Is it envisaged that each cluster as a deployment target will consume a licence?

Yes, we do envisage each cluster is represented by a target, and therefore contributes to the target count provided by a license.

mcasperson commented 3 years ago

@andycherry

No, we don't use fargate, we also have autoscaling on some services, and we have some scheduled tasks.

That is good feedback, thank you. We are definitely looking to implement EC2 based services in future milestones, and based on this feedback I think we'll have to consider scheduled tasks as well.

ThomasHjorslevFcn commented 3 years ago

Will the proposed step and target will work for your ECS deployments?

For the near future yes. EC2, Auto Scaling and AppMesh is our our to do list We are increasingly deploying gRPC services which require special Load Balancer Listener settings, so those would need to be supported.

What does your ECS architecture looks like?

Around 20 ECS Fargate Linux services and maybe 30-40 Windows EC2 (not Docker) services slowly being migrated to ECS.

Do you have multiple clusters?

Yes, currently one per environment

Do you have multiple AWS accounts?

No

What kinds of applications are you deploying?

dotnet and node.js Linux HTTP services (APIs and web sites)

What ECS deployment challenges do you wish Octopus could solve for you?

We are currently using Octopus CloudFormation support to deploy ECS and configure load balancers, etc and it's mostly very smooth. However, managing variables is a pain. We never found a good way to manage app config, so we recently started putting a config file on S3. Now there's just an env var pointing to the config. However, being able to flexibly define container env variables from OD would be great :-)

Another pain point with CF is that if health checks fail, it takes forever (like 30 minutes) for a CF deployment to fail and you have to log into AWS to manually abort it. AWS support promised this would be improved, but with no time frame. Maybe OD could cancel the CF deployment automatically after a (shorter) timeout or perhaps detect that the deployment is failing earlier and cancel?

gary-lg commented 3 years ago

Will the proposed step and target will work for your ECS deployments?

Maybe. Fargate targetting is "Ok". We also use scheduled tasks and scaling policies . It depends on how images are sourced per environment.

What does your ECS architecture looks like?

One AWS root account creates any VPC's used. Sub-accounts are created per domain/environment which have permission to use the VPC's via Resource Access Management (RAM) shares. Multiple clusters can exist in each environment account (one per project). Each environment account has it's own Elastic Container Repository (ECR). The CI service builds and tags images then pushes them to all Repositories in all environments. ECS containers run with fargate, scale based on load and are served from behind an ALB. Scripts using the AWS CLI are used to add a "deploy" tag, task definitions always look for the "deploy" tag and the scripts ask for a service reload.

Do you have multiple clusters?

Yes including clusters that have definitions attached to schedules.

Do you have multiple AWS accounts?

Yes (see above). A root account that manages shared network resources and one account per environment

What kinds of applications are you deploying?

Some pretty standard website/API type things but also long running scheduled tasks get spun up into a cluster.

What ECS deployment challenges do you wish Octopus could solve for you?

I want to know which container version is deployed in all my environments at a glance. I want rolling back/forward to be simple. I want to have the deploy result in zero downtime with an automated rollback if healthchecks fail (like the promised Blue/Green deploys in Code Deploy which are tough to setup correctly with resources across accounts).

blytheaw commented 3 years ago

Will the proposed step and target will work for your ECS deployments? Eventually, yes. We do use auto-scaling and Firelens heavily so the first milestone may not be usable for us. The opinionated approach could be useful, but we already manage so much of the ECS/Fargate infrastructure with Terraform. I would love to be able to keep those concerns separated (infrastructure vs code deployment)

What does your ECS architecture looks like? We have Fargate clusters for each environment. Each cluster has 30+ services running a number of auto-scaled tasks. A small number of Application Load Balancers are shared by the services, distinguished by various requirements (internal vs external, for example). The infrastructure portion of the setup is managed via Terraform (i.e. load balancers, ECS task definitions, services, etc). Github Actions + Octopus handles the building and deploying of the application Docker images. We are currently using [ECS Deploy]() in an Octopus script step to execute the actual deployment. This is creating a copy of the task definition managed by Terraform during the deployment. We use Firelens for shipping logs to New Relic.

It is worth noting we use a shared Terraform module to repeatably create all the infrastructure for a single Fargate service: service, load balancer listeners and target groups, base task definition, CloudWatch dashboards, autoscaling policies. It would be nice to be able to keep this but leverage Octopus just to deploy new images.

Do you have multiple clusters? Yes, development and production clusters.

Do you have multiple AWS accounts? Right now we have a single AWS account.

What kinds of applications are you deploying? RESTful APIs mostly running on .NET Core

What ECS deployment challenges do you wish Octopus could solve for you?

Make it easy to use Octopus variables as Environment Variables in ECS Task Definitions. Right now we are pulling in configuration at runtime which isn't ideal.
Allow de-coupling of infrastructure management (load balancers, ECS services) from application deployment. See this incredibly long issue. At the very least, I still need to be able to share a load balancer between many ECS services. But I would also like to be able to update auto-scaling policies, container resources, etc without needing to do a code deployment (i.e. create a release or new build). Because of the way ECS works with task definitions, this has been very difficult to get around.

Excited for the potential here! Thanks!

mcasperson commented 3 years ago

@ThomasHjorslevFcn

Another pain point with CF is that if health checks fail, it takes forever (like 30 minutes) for a CF deployment to fail and you have to log into AWS to manually abort it. AWS support promised this would be improved, but with no time frame. Maybe OD could cancel the CF deployment automatically after a (shorter) timeout or perhaps detect that the deployment is failing earlier and cancel?

I've run into this myself more than once, so a timeout would be a good option. Thanks for the feedback!

mcasperson commented 3 years ago

@gary-lg

like the promised Blue/Green deploys in Code Deploy which are tough to setup correctly with resources across accounts

This is interesting. Are you able to offer some more insight into the issues you have run into with blue/green deployments and multiple accounts?

mooseracerPT commented 3 years ago

Will the proposed step and target will work for your ECS deployments?

Maybe? Depends on if it supports importing existing ECS Services.

What does your ECS architecture looks like?

Fargate-only. One ECS cluster per environment+microservice, e.g. test-thing1, prod-thing1, test-thing2, prod-thing2. Multiple services per cluster, usually a combination of long-lived APIs with run-then-exit maintenance tasks.

We manage all AWS infrastructure with terraform and deploy to ECS through CI pipelines (Bitbucket, Github). We've avoided running terraform from pipelines -- the Task Definitions do not regularly get updated when new images are published. We've managed to keep them static by tagging docker images with environment names rather than versions, and then rely on the CI pipelines for environmental auditing/re-deployability.

Example flow from Bitbucket Pipelines / Github Actions:

.NET Core build & unit tests
docker build, publish to ECR tagged to the environment (e.g. :test)
update the ECS service with --force-new-deployment
invoke short-lived maintenance task (we run Roundhouse to execute database scripts)
automated API tests

When deploying to higher environments we don't build the image again, we just re-tag the previous environment (e.g. tag :test as :prod and update the ECS service).

Do you have multiple clusters?

dozens

Do you have multiple AWS accounts?

yes

What kinds of applications are you deploying?

New apps are primarily .NET Core on ECS. The bulk are still legacy .NET deployed by Octopus to Windows Server on EC2.

What ECS deployment challenges do you wish Octopus could solve for you?

It would definitely be nice to have Task Definition management moved to Octopus instead of terraform. But having the CloudFormation it's doing play nicely with what we control in terraform might be tricky. And I can see issues coming up with Secrets management -- we have task definitions populate them directly from AWS Secrets Manager, and don't want to have duplicate secrets all over the place. We do this so environment variables can be managed outside of terraform (i.e. managed by Dev teams rather than only Ops). This kind of leads to the following ideal: Decouple the deployment of application infrastructure from application code from application configuration. Each should be independently deployable.

Unrelated: I'd be keen on having your proposed Octopus Target for ECS support using your Octopus Proxy servers, so that we could do all the IAM for managing ECS on our proxy instances' roles.

mcasperson commented 3 years ago

@blytheaw

We are currently using ECS Deploy in an Octopus script step to execute the actual deployment. This is creating a copy of the task definition managed by Terraform during the deployment.

I was interested to get some more details on this workflow.

Reading your comment and the Terraform issue, it sounds like you use Terraform to create the initial ECS environment, which is mostly static. And then you use Octopus to manage the day to day deployments, which copies the initial task definition created by Terraform and presumably updates it with "dynamic" values like Docker image tags. Is this correct?

Allow de-coupling of infrastructure management (load balancers, ECS services) from application deployment.

If I'm reading this correctly you'd like to create the initial ECS services and load balancers with Terraform, and have the deployment workflow only update a small subset of the configuration for those resources e.g. update the service with a new task definition. Is this correct?

mcasperson commented 3 years ago

@mooseracerPT

Unrelated: I'd be keen on having your proposed Octopus Target for ECS support using your Octopus Proxy servers, so that we could do all the IAM for managing ECS on our proxy instances' roles.

We didn't show the various IAM fields in the RFC mockups, but we would implement the same logic that has been exposed on the Kubernetes targets, which allows authentication either by an account or via an EC2 IAM role assigned to a worker, and both then having the additional option of inheriting a second role. The screenshot below shows the fields that would appear on the ECS target.

I just want to double check what you were referring to when you mentioned "Octopus Proxy servers". I've assumed you were talking about a worker, but are you taking about a more traditional proxy server here?

mooseracerPT commented 3 years ago

@mcasperson

IAM fields ... are you taking about a more traditional proxy server here?

Cool, those IAM fields look good. I was referring to a traditional proxy, yes. The only Octopus worker we use at the moment is on the Octopus server itself, but we could use that model and install them on our proxies too, thanks. 👍

If I'm reading this correctly you'd like to create the initial ECS services and load balancers with Terraform, and have the deployment workflow only update a small subset of the configuration for those resources e.g. update the service with a new task definition. Is this correct?

Yep, and it would be good if Octopus keeps polling the status of that ECS deployment so the deployer knows when it's done.

mcasperson commented 3 years ago

@mooseracerPT - Thanks for clarifying!

gary-lg commented 3 years ago

@mcasperson

This is interesting. Are you able to offer some more insight into the issues you have run into with blue/green deployments and multiple accounts?

Sure, here's what we wanted to happen:

The root account contains the build pipeline. Commits trigger a CodeBuild task which builds, tests, creates docker images and pushes them into ECR. We then move to a deploy step which would, after manual approval, deploy into the staging account cluster and after a second approval push into the Production account cluster.

The issue is that we couldn't get the code deploy pipeline in the root account to talk to the ECS cluster in the sub account regardless of permission and role settings - it was always looking for a cluster by name (not ARN) in the containing account. We could've got around this with pipelines in each account that were triggered by notifications or webhooks or whatever but by this point it defeats the object of having our configuration all in one place and visible in "one pane of glass".

As a result we went with writing scripts and triggering them from our existing CI/CD tool (Circle at time of writing 'cos that's where we build our images).

Does that answer any questions you had?

mcasperson commented 3 years ago

@gary-lg Thanks for the additional information, it does answer my question. Much appreciated.

mcasperson commented 3 years ago

Thank you to all who provided feedback so far. This has been very helpful, and we'll use these comments to help shape the ECS functionality going forward.

To summarize the major themes of the responses so far: • Multiple accounts and multiple clusters must be supported. • EC2 based workflows are a requirement for many. • Blue/green, roll forward/back must be easy. • Variable management is a challenge. • Teams may have existing task definitions and services that they would like to deploy into with Octopus. • Infrastructure management (via CloudFormation, Terraform etc) is a critical, and teams may want to retain control of this.

blytheaw commented 3 years ago

@mcasperson

@blytheaw

We are currently using ECS Deploy in an Octopus script step to execute the actual deployment. This is creating a copy of the task definition managed by Terraform during the deployment.

I was interested to get some more details on this workflow.

Sure, we are using this tool in an Octopus Script step. This tool is essentially doing the "heavy lifting" of the deployment. We first make a copy of a "template" task definition that is managed by Terraform. We update the image tag as part of this copy. Finally, we use the ecs deploy CLI command to execute the deployment, which handles the rolling deploy, rollback on failure, etc.

Reading your comment and the Terraform issue, it sounds like you use Terraform to create the initial ECS environment, which is mostly static. And then you use Octopus to manage the day to day deployments, which copies the initial task definition created by Terraform and presumably updates it with "dynamic" values like Docker image tags. Is this correct?

Yes, the pace and frequency of the changes Terraform might make to things like autoscaling configuration, min/max tasks, task size, etc. is much slower than the rate at which we deploy code changes in new Docker images.

Allow de-coupling of infrastructure management (load balancers, ECS services) from application deployment.

If I'm reading this correctly you'd like to create the initial ECS services and load balancers with Terraform, and have the deployment workflow only update a small subset of the configuration for those resources e.g. update the service with a new task definition. Is this correct?

Yes, that is correct.

mcasperson commented 3 years ago

@blytheaw Thanks for the additional information!

pete-may-bam commented 3 years ago

Will the proposed step and target work for your ECS deployments? It looks like it.

What does your ECS architecture look like? API Gateway -> Load balancer -> Fargate managed containers

Do you have multiple clusters? Yes, one per application per environment (Dev / UAT / Production).

Do you have multiple AWS accounts? Yes, one per environment, and a common account for ECR.

What kinds of applications are you deploying? ASP.NET Core RESTful APIs.

What ECS deployment challenges do you wish Octopus could solve for you?

Configuring autoscaling

Scheduled patching

I like the idea of treating an ECS cluster as a deployment target. Not sure that it makes sense to assign roles to an EC2 cluster ( 🤔 ) - it wasn't something you mentioned but a thought I had.

As others have said, I see value in separating configuring the infrastructure from deployments. I plan to use Terraform to set up the infrastructure, but would like to run it from Octopus to enable transient environments for development teams.

mcasperson commented 3 years ago

Thank you all for your feedback. This has been very helpful.

We have committed to delivering the first ECS milestone in the current cycle of work, and are looking to include it in an upcoming release of Octopus (we are aiming for 2021.3, although this is not finalized yet).

Although this cycle will not deliver a step that allows a new image tag to be updated in an existing service (for example, a service that has already been created manually or via a Terraform template), we have taken that feedback onboard and will likely include support for this scenario with the next ECS milestone.

It is also clear that support for EC2 instances is a requirement for many people, so this too will be something we look to include in future scenarios.

Keep an eye on the Octopus blog for updates on the first ECS milestone, and a new RFC for any new milestones to come.

mcasperson commented 3 years ago

We have a RFC for the second milestone of ECS work. The blog post describing this new milsestone can be found here, and the feedback issue can be found here.

Milestone 2 was heavily influenced by the feedback we received from the first RFC, so thank you all for sharing your use cases.

mcasperson commented 3 years ago

@pete-may-bam and @mooseracerPT - The new RFC focuses on deploying to existing ECS clusters to support teams that wish to maintain control over how infrastructure is created, and want to delegate image deployments to Octopus. I'd be interested to hear if the proposed functionality would work for your use case?

mcasperson commented 3 years ago

@mhudson @qnm @Hawxy @fdalyroomex @andycherry @ThomasHjorslevFcn
@gary-lg
@blytheaw @mooseracerPT @pete-may-bam

We are getting close to releasing milestone 1 of our ECS integration, and as a thank you for commenting on this RFC I have recorded a short demo of the new feature as a sneak peek.

https://youtu.be/mLy4_uo6qtw

The feedback for milestone 2 can be found here.

Thanks again for submitting your feedback to this RFC!

rhysparry commented 3 years ago

I'm happy to do a personal walkthrough if you'd like to know more. Book time in my calendar here: https://calendly.com/rhysparry

rhysparry commented 2 years ago

We have released Milestone 1 of the ECS integration: https://octopus.com/blog/octopus-release-2021-q4

Milestone 2 is on its way.

OctopusDeploy / StepsFeedback

ECS Milestone 1 Feedback #1