Extend ECS section to cover stop-task action

rudpot commented 2 years ago

Hi All, the FIS service and console has launched a new action, aws:ecs:stop-task (ECS Stop Tasks) in all commercial regions- see documentation below! User Guide https://docs.aws.amazon.com/fis/latest/userguide/getting-started-iam-service-role.html https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html#ecs-actions-reference https://docs.aws.amazon.com/fis/latest/userguide/targets.html https://docs.aws.amazon.com/fis/latest/userguide/using-service-linked-roles.html https://docs.aws.amazon.com/fis/latest/userguide/security-iam-awsmanpol.html API Reference https://docs.aws.amazon.com/fis/latest/APIReference/API_Operations.html

SH4DY commented 2 years ago

Hi @rudpot, I'd be happy to add this section. I'll run through the workshop later today and I could then outline my ideas. Is that ok?

SH4DY commented 2 years ago

I see that the workshop contains a section about ECS, using the aws:ec2:terminate-instances on the underlying ASG of the ECS cluster. Then it guides to reader to implement a minimum ASG size of 2 and to increase the task count.

I can think of two options to include stop-task:

Continuing after the FisWorkshopECS experiment, we setup another experiment with stop-task and let the reader observe how the website stays available when we kill 50% of the tasks belonging to the service. This is in-line with the EC2 experiment.
1. Graceful degradation: We modify the website to depend on an auxiliary service, which we then stop, causing the website to crash. Then we let the reader fix the dependency, so the website stays available without the functionality the auxiliary service provides.

What do you think @rudpot?

rudpot commented 2 years ago

Ideally we would demonstrate this on ECS/Fargate instead of ECS/EC2.

I think both are viable and the latter would demonstrate good resilience best practices but is more work. If you want to use some pre-existing materials, this might be helpful: https://github.com/gunnargrosch/serverless-chaos-demo-nodejs in conjunction with Gunnar's talk at conf42.

There is also the distinction whether the task is part of a service (in line with EC2 experiment) or a standalone worker task so if you think there is a real-world use case where you might want to kill of a standalone task that's an option.

Finally if you want to include some commentary on ECS with docker compose that could be interesting. I expect that this will just generate tasks with multiple containers attached so it probably doesn't make a difference but worth exploring.

SH4DY commented 2 years ago

Hey @rudpot, I agree we should use Fargate.

I thought about how to show a succint but still realistic scenario with a worker task and wrote a few lines of code yesterday. Here is my idea: We provide a CDK app that defines a Fargate task with a little Node app, simulating a simple batch job. This batch job writes its results into DynamoDB. The reader starts the experiment to kill the task before finishing. The expectation when restarting the task should be that it continues where it was stopped, instead of starting at 0.

I wrote a little app that just writes an incrementing number into Dynamo until it reaches batchJobSize. We can let the reader inspect the progress on the DynamoDB console or through CloudWatch.

Thoughts?

rudpot commented 2 years ago

It's a good way to create visibility for the experiment here a few thoughts questions for implementing that:

how will you visibly demonstrate that the task did in fact restart? Have a look at the spot instance example and the graph at the bottom of the page.
how will the task be restarted? The object of all this work is to increase resilience so the presumption is that the system is already in some way architected for resilience or that there is a clear learning how to increase resilience. E.g. the task already has auto-restart, e.g. by being part of a service, and is initially not keeping state and as a result of the experiment you add the continue-on-restart capability. E.g. the task already has continue-on-restart capability but doesn't initially have an auto-restart and you show how to add that
you suggest using a CDK template to set up a Fargate task. Fully in favor but it leaves you with two choices: (1) build another standalone template folder with deploy.sh/cleanup.sh scripts or (2) include it in the existing ecs template. Due to the spin-up time of ECS clusters the latter is preferred.

SH4DY commented 2 years ago

See answer below. I'd use CloudWatch logs to show that continue-on-restart worked as intended (every DB write logs its index). Similar to the ECS experiment that uses log output.
Does ECS have some sort of auto-restart for tasks (not services)? If not, I can only think of using Lambda and/or EventBridge to add this capability. But that would add a lot of complexity to the workshop. There is value to show standalone tasks because the reader has seen ECS services in earlier chapters. And using an ECS service for a batch job is not great either. My idea was to let the reader add the continue-on-restart capability as an improvement after the first experiment run. The reader would restart the task manually. At the end of the chapter, we could simply mention an auto-restart capability as additional improvement.
Sure, I can add it into the ecs template.
There is one downside to all of this: The reader will need to have a running Docker daemon if we want them to be able to add the continue-on-restart capability into the Node app (build container as part of CDK deploy). Alternatively, we could provide 2 container images (one without continue-on-restart, one with continue-on-restart) and host it on ECR.

rudpot commented 2 years ago

Sure tracking progress in cloudwatch works.

Regarding restart mechanisms, that was just one thought. If you do build in a restart it would need to be external to the task because the task is being killed.

Sent from Workspace ONE Boxer

On Apr. 2, 2022 03:13, SH4DY @.***> wrote:

See answer below. I'd use CloudWatch logs to show that continue-on-restart worked as intended (every DB write logs its index). Similar to the ECS experimenthttps://chaos-engineering.workshop.aws/en/030_basic_content/070_containers/010_ecs/20-observe.html that uses log output.
Does ECS have some sort of auto-restart for tasks (not services)? If not, I can only think of using Lambda and/or EventBridge to add this capability. But that would add a lot of complexity to the workshop. There is value to show standalone tasks because the reader has seen ECS services in earlier chapters. And using an ECS service for a batch job is not great either.
My idea was to let the reader add the continue-on-restart capability as an improvement after the first experiment run. The reader would restart the task manually. At the end of the chapter, we could simply mention an auto-restart capability as additional improvement.

- Reply to this email directly, view it on GitHubhttps://github.com/aws-samples/aws-fault-injection-simulator-workshop/issues/274#issuecomment-1086597909, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL64MGTFIRKLFRQAPITNNPTVDAFRRANCNFSM5OEUTGLQ. You are receiving this because you were mentioned.Message ID: @.***>

SH4DY commented 2 years ago

Quick update here:

I've built a CDK app including the aforementioned Node app that will run as an ECS task. It writes into DynamoDB periodically, simulating some sort of batch job.
I created a new chapter ECS and subchapter ECS Tasks on the same level as ECS Service (= existing subchapter).
Wrote instructions inside ECS Tasks following the "experiment - observe - fix/repeat" pattern

Will post the instructions for review later this week.

Still ToDo:

Integrate CDK app into existing ECS CDK app
Build and upload 2 base images like "app without graceful degradation" + "app with graceful degradation". Currently building the Node app is part of the CDK project.

rudpot commented 2 years ago

Would you mind putting this on a branch like sh4dy/fix-274 and pushing it up to the repo - that makes it easier to comment / collaborate.

SH4DY commented 2 years ago

Hi Rudolf!

Sure, I've put everything into branch fix-274 in my fork. You can check it here.

I could also create a PR against fix-274 in this repo, but you would need to create the branch for me. I don't think I have permissions for that.

Currently you will find the new chapter structure and instructions in fix-274. I have not committed CDK code yet. There are quite a few file changes because I moved some folders around in the ECS chapter....

SH4DY commented 2 years ago

Hi @rudpot I made some more progress here.

Added architecture diagram
Added screenshots
Switched to using targeting via tags in FIS (much more convenient to use for the reader)

Here is the compare with all commits: https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274

rudpot commented 2 years ago

Hi SH4DY!

I don’t have a good way to comment inline so doing this by email – my biggest concern is that your workflow is sensitive to how quickly the workshop user starts the ECS tasks and the FIS experiments and there are a lot of race condition edge cases that will leave the user with a bad experience. Please review and improve.

https://github.com/SH4DY/aws-fault-injection-simulator-workshop/blob/fix-274/workshop/content/030_basic_content/070_containers/010_ecs/_index.en.md – swap the two paragraphs and change wording a bit:

Amazon Elastic Container Service (ECS) is a fully managed container orchestration service that helps you easily deploy, manage, and scale containerized applications. It deeply integrates with the rest of the AWS platform to provide a secure and easy-to-use solution for running container workloads in the cloud and now on your infrastructure with Amazon ECS Anywhere.

In this section you will find two experiments. In the first you will spin up a web service on ECS and use FIS to explore and improve its resilience. In the second, you will work with an invoked ECS task that represents something like a recurring batch job that needs to be restarted by external automation on any failure. You will use FIS to simulate a failure by stopping the task, and use the learnings to improve overall performance on the task.

https://github.com/SH4DY/aws-fault-injection-simulator-workshop/blob/fix-274/workshop/content/030_basic_content/070_containers/010_ecs/010_ecs_service/10-experiment/_index.en.md - Note: what you are describing is not called “graceful degradation” but “checkpointing”. Normally we would use “graceful degradation” for a system that can execute only part of its function, e.g. a webpage cleanly rendering components for which data is available while skipping components that can’t be rendered or showing information messages about the degradation. In this example “graceful degradation” might mean “normally we would checkpoint, but if the checkpointing fails, we will restart from the beginning” which is the opposite of what you are showing 😊

Note: what happens to your experiment design if multiple instances of the task are started concurrently? Probably BadThings™ - my suggestion would be to either make the invocation idempotent so concurrency is not a problem or to at least comment on the problems with the experiment design, maybe in the Learning&Improving section.

Overall can you rework the experiment idea / learnings approach a little? Option 1: Do you simply assume that restarting the job will eventually lead to completion and the learning is that this is suboptimal and you want to add checkpointing (Spot instance example)? In that case update the hypothesis and learnings. Option 2: Do you assume that the task is already capable of restarting but the source code contains a mistake that prevents that? In that case your code should already start with a continueAt() function that’s broken, e.g. by ignoring the cont argument.

Specific issues to address

 *   Fix “graceful degradation” throughout
 *   Validation procedure: “run task successfully” isn’t a well-defined validation procedure and the sequence of actions is not really clear. What is the workflow you expect the human to perform? Start task, (immediately or 10min later) start FIS experiment (could the user be fast enough that there is no log record before the task gets killed? Could the user be slow enough that there is no longer a task so the associated log record can’t be seen)? What constitutes “success”? Also if the user lets it run to completion the first time why does it work the second time (see the DB clear action in learning and improving)?

https://github.com/SH4DY/aws-fault-injection-simulator-workshop/tree/fix-274/workshop/content/030_basic_content/070_containers/010_ecs/010_ecs_service/20-observe - specific issues to address:
- “we want to restart it and hope” – we don’t ever want to “hope”. We have a hypothesis and we “expect” something that can be shown to be right/wrong. Similarly “it has some sort of ability to start where it left off” we should be able to point at the code that does the checkpointing / continuation and we should have a clear expectation that it should work. If we are just “assuming” that the system is resilient we should not be doing fault injection experiments 😝
- “Unfortunately, you will see that the task has no notion of “graceful degradation” and it wasn’t able to capture it’s progress from the first run. It simply started from index 0.” – fix “gracefule degradation” and depending on what your hypothesis ultimately was rewrite this to clarify what the unexpected discovery is.
- Move the whole “We could make multiple improvements to the current state” section to the next page as this is all learning and improving
https://github.com/SH4DY/aws-fault-injection-simulator-workshop/blob/fix-274/workshop/content/030_basic_content/070_containers/010_ecs/010_ecs_service/30-fix-repeat/_index.en.md – specific issues to address:
- “Now we want to improve the application and re-run the FIS experiment.” – “we want” is not a good motivation for action in a business context. Provide some rationale like decreasing compute time or cost.
- “This approach is extremely simplistic and certainly not production-ready, but it serves the purpose for this example.” – make some suggestions for how to improve it in production.
- Fix heading levels – I suspect “Improved code” is one level too low
- Align the code you are showing to the hypothesis and make it accessible to someone who isn’t an expert in JS and who hasn’t seen the whole code. It should be clear what part of the code was preserved (the main loop). Things that help with this: a common invoke pattern for the main loop, preserving variable names (why did index change to cont?), comments. I would also change the “Continuing at” log line to “Starting at” because that’s true in all cases;
- Add the code link so we can see the whole function
- Check progress should be done with CloudWatch logs not from the ECS console so you can see that the process continued rather than restarting. Also you aren’t invoking the stop-task action again so you would not get the desired results.

Cheers,

Rudolf

Rudolf Potucek Senior Solutions Architect, Startup AWS Solutions Architecture @.**@.> | 403 903 4980

[signature_433175265]

Thoughts on our interaction? Provide Feedback herehttps://feedback.aws.amazon.com/?ea=rudpot&fn=Rudolf&ln=Potucek.

From: SH4DY @.> Reply-To: aws-samples/aws-fault-injection-simulator-workshop @.> Date: Tuesday, May 17, 2022 at 07:38 To: aws-samples/aws-fault-injection-simulator-workshop @.> Cc: "Potucek, Rudolf" @.>, Mention @.***> Subject: Re: [aws-samples/aws-fault-injection-simulator-workshop] Extend ECS section to cover stop-task action (Issue #274)

Hi @rudpothttps://github.com/rudpot I made some more progress here.

Added architecture diagram
Added screenshots

Here is the compare with all commits: main...SH4DY:fix-274https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274

— Reply to this email directly, view it on GitHubhttps://github.com/aws-samples/aws-fault-injection-simulator-workshop/issues/274#issuecomment-1128882531, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL64MGX7BRTLNQPSEGFQUBLVKOONXANCNFSM5OEUTGLQ. You are receiving this because you were mentioned.Message ID: @.***>

SH4DY commented 2 years ago

Hi @rudpot

thanks for your comments:

The nature of tasks is that they run a limited amount of time. Therefore the experiment is somewhat time sensitive. I agree with you. I still believe it's a worthwhile learning for our readers to launch/interrupt/observe/improve such a task. We can easily increase the time needed for the task to complete. That will give the reader plenty of time to interrupt the task and observe how it behaves on restart.

I addressed all your feedback.

Detailed changes:

Changed wording and explanations throughout to "checkpointing"
Experiment idea: The idea is that the job will already complete when restarted but it doesn't have any notion of checkpointing. When it's interrupted, it will start from index 0. The IMPROVEMENT is adding checkpointing. I have adjusted the writeup to clarify this.
Added a note that the experiment is not designed to be run with multiple concurrent tasks. The improvement section contains ideas how parallelization can be introduced.
Changed to CloudWatch console for progress/result checking everywhere
Changed code samples in IMPROVEMENT section
Provided reasoning for improvements (time, resource savings, cost)

Here is the compare: https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274

Once I have your green light when it comes to the overall writeup and experiment idea, I will:

Update screenshots
Merge my code into the project and update some sections
Insert links to code into writeup
Address all TODO's

aws-samples / aws-fault-injection-simulator-workshop

Extend ECS section to cover stop-task action #274