aws-samples / aws-fault-injection-simulator-workshop

MIT No Attribution
38 stars 17 forks source link

Extend ECS section to cover stop-task action #274

Open rudpot opened 2 years ago

rudpot commented 2 years ago

Hi All, the FIS service and console has launched a new action, aws:ecs:stop-task (ECS Stop Tasks) in all commercial regions- see documentation below! User Guide https://docs.aws.amazon.com/fis/latest/userguide/getting-started-iam-service-role.html https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html#ecs-actions-reference https://docs.aws.amazon.com/fis/latest/userguide/targets.html https://docs.aws.amazon.com/fis/latest/userguide/using-service-linked-roles.html https://docs.aws.amazon.com/fis/latest/userguide/security-iam-awsmanpol.html API Reference https://docs.aws.amazon.com/fis/latest/APIReference/API_Operations.html

SH4DY commented 2 years ago

Hi @rudpot, I'd be happy to add this section. I'll run through the workshop later today and I could then outline my ideas. Is that ok?

SH4DY commented 2 years ago

I see that the workshop contains a section about ECS, using the aws:ec2:terminate-instances on the underlying ASG of the ECS cluster. Then it guides to reader to implement a minimum ASG size of 2 and to increase the task count.

I can think of two options to include stop-task:

  1. Continuing after the FisWorkshopECS experiment, we setup another experiment with stop-task and let the reader observe how the website stays available when we kill 50% of the tasks belonging to the service. This is in-line with the EC2 experiment.

    1. Graceful degradation: We modify the website to depend on an auxiliary service, which we then stop, causing the website to crash. Then we let the reader fix the dependency, so the website stays available without the functionality the auxiliary service provides.

What do you think @rudpot?

rudpot commented 2 years ago

Ideally we would demonstrate this on ECS/Fargate instead of ECS/EC2.

I think both are viable and the latter would demonstrate good resilience best practices but is more work. If you want to use some pre-existing materials, this might be helpful: https://github.com/gunnargrosch/serverless-chaos-demo-nodejs in conjunction with Gunnar's talk at conf42.

There is also the distinction whether the task is part of a service (in line with EC2 experiment) or a standalone worker task so if you think there is a real-world use case where you might want to kill of a standalone task that's an option.

Finally if you want to include some commentary on ECS with docker compose that could be interesting. I expect that this will just generate tasks with multiple containers attached so it probably doesn't make a difference but worth exploring.

SH4DY commented 2 years ago

Hey @rudpot, I agree we should use Fargate.

I thought about how to show a succint but still realistic scenario with a worker task and wrote a few lines of code yesterday. Here is my idea: We provide a CDK app that defines a Fargate task with a little Node app, simulating a simple batch job. This batch job writes its results into DynamoDB. The reader starts the experiment to kill the task before finishing. The expectation when restarting the task should be that it continues where it was stopped, instead of starting at 0.

I wrote a little app that just writes an incrementing number into Dynamo until it reaches batchJobSize. We can let the reader inspect the progress on the DynamoDB console or through CloudWatch.

Thoughts?

rudpot commented 2 years ago

It's a good way to create visibility for the experiment here a few thoughts questions for implementing that:

SH4DY commented 2 years ago
  1. See answer below. I'd use CloudWatch logs to show that continue-on-restart worked as intended (every DB write logs its index). Similar to the ECS experiment that uses log output.
  2. Does ECS have some sort of auto-restart for tasks (not services)? If not, I can only think of using Lambda and/or EventBridge to add this capability. But that would add a lot of complexity to the workshop. There is value to show standalone tasks because the reader has seen ECS services in earlier chapters. And using an ECS service for a batch job is not great either. My idea was to let the reader add the continue-on-restart capability as an improvement after the first experiment run. The reader would restart the task manually. At the end of the chapter, we could simply mention an auto-restart capability as additional improvement.
  3. Sure, I can add it into the ecs template.
  4. There is one downside to all of this: The reader will need to have a running Docker daemon if we want them to be able to add the continue-on-restart capability into the Node app (build container as part of CDK deploy). Alternatively, we could provide 2 container images (one without continue-on-restart, one with continue-on-restart) and host it on ECR.
rudpot commented 2 years ago

Sure tracking progress in cloudwatch works.

Regarding restart mechanisms, that was just one thought. If you do build in a restart it would need to be external to the task because the task is being killed.

Sent from Workspace ONE Boxer

On Apr. 2, 2022 03:13, SH4DY @.***> wrote:

  1. See answer below. I'd use CloudWatch logs to show that continue-on-restart worked as intended (every DB write logs its index). Similar to the ECS experimenthttps://chaos-engineering.workshop.aws/en/030_basic_content/070_containers/010_ecs/20-observe.html that uses log output.
  2. Does ECS have some sort of auto-restart for tasks (not services)? If not, I can only think of using Lambda and/or EventBridge to add this capability. But that would add a lot of complexity to the workshop. There is value to show standalone tasks because the reader has seen ECS services in earlier chapters. And using an ECS service for a batch job is not great either.
  3. My idea was to let the reader add the continue-on-restart capability as an improvement after the first experiment run. The reader would restart the task manually. At the end of the chapter, we could simply mention an auto-restart capability as additional improvement.

- Reply to this email directly, view it on GitHubhttps://github.com/aws-samples/aws-fault-injection-simulator-workshop/issues/274#issuecomment-1086597909, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL64MGTFIRKLFRQAPITNNPTVDAFRRANCNFSM5OEUTGLQ. You are receiving this because you were mentioned.Message ID: @.***>

SH4DY commented 2 years ago

Quick update here:

Will post the instructions for review later this week.

Still ToDo:

rudpot commented 2 years ago

Would you mind putting this on a branch like sh4dy/fix-274 and pushing it up to the repo - that makes it easier to comment / collaborate.

SH4DY commented 2 years ago

Hi Rudolf!

Sure, I've put everything into branch fix-274 in my fork. You can check it here.

I could also create a PR against fix-274 in this repo, but you would need to create the branch for me. I don't think I have permissions for that.

Currently you will find the new chapter structure and instructions in fix-274. I have not committed CDK code yet. There are quite a few file changes because I moved some folders around in the ECS chapter....

SH4DY commented 2 years ago

Hi @rudpot I made some more progress here.

Here is the compare with all commits: https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274

rudpot commented 2 years ago

Hi SH4DY!

I don’t have a good way to comment inline so doing this by email – my biggest concern is that your workflow is sensitive to how quickly the workshop user starts the ECS tasks and the FIS experiments and there are a lot of race condition edge cases that will leave the user with a bad experience. Please review and improve.

Amazon Elastic Container Service (ECS) is a fully managed container orchestration service that helps you easily deploy, manage, and scale containerized applications. It deeply integrates with the rest of the AWS platform to provide a secure and easy-to-use solution for running container workloads in the cloud and now on your infrastructure with Amazon ECS Anywhere.

In this section you will find two experiments. In the first you will spin up a web service on ECS and use FIS to explore and improve its resilience. In the second, you will work with an invoked ECS task that represents something like a recurring batch job that needs to be restarted by external automation on any failure. You will use FIS to simulate a failure by stopping the task, and use the learnings to improve overall performance on the task.

Note: what happens to your experiment design if multiple instances of the task are started concurrently? Probably BadThings™ - my suggestion would be to either make the invocation idempotent so concurrency is not a problem or to at least comment on the problems with the experiment design, maybe in the Learning&Improving section.

Overall can you rework the experiment idea / learnings approach a little? Option 1: Do you simply assume that restarting the job will eventually lead to completion and the learning is that this is suboptimal and you want to add checkpointing (Spot instance example)? In that case update the hypothesis and learnings. Option 2: Do you assume that the task is already capable of restarting but the source code contains a mistake that prevents that? In that case your code should already start with a continueAt() function that’s broken, e.g. by ignoring the cont argument.

Specific issues to address

 *   Fix “graceful degradation” throughout
 *   Validation procedure: “run task successfully” isn’t a well-defined validation procedure and the sequence of actions is not really clear. What is the workflow you expect the human to perform? Start task, (immediately or 10min later) start FIS experiment (could the user be fast enough that there is no log record before the task gets killed? Could the user be slow enough that there is no longer a task so the associated log record can’t be seen)? What constitutes “success”? Also if the user lets it run to completion the first time why does it work the second time (see the DB clear action in learning and improving)?

Cheers,

Rudolf

Rudolf Potucek Senior Solutions Architect, Startup AWS Solutions Architecture @.**@.> | 403 903 4980

[signature_433175265]

Thoughts on our interaction? Provide Feedback herehttps://feedback.aws.amazon.com/?ea=rudpot&fn=Rudolf&ln=Potucek.

From: SH4DY @.> Reply-To: aws-samples/aws-fault-injection-simulator-workshop @.> Date: Tuesday, May 17, 2022 at 07:38 To: aws-samples/aws-fault-injection-simulator-workshop @.> Cc: "Potucek, Rudolf" @.>, Mention @.***> Subject: Re: [aws-samples/aws-fault-injection-simulator-workshop] Extend ECS section to cover stop-task action (Issue #274)

Hi @rudpothttps://github.com/rudpot I made some more progress here.

Here is the compare with all commits: main...SH4DY:fix-274https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274

— Reply to this email directly, view it on GitHubhttps://github.com/aws-samples/aws-fault-injection-simulator-workshop/issues/274#issuecomment-1128882531, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL64MGX7BRTLNQPSEGFQUBLVKOONXANCNFSM5OEUTGLQ. You are receiving this because you were mentioned.Message ID: @.***>

SH4DY commented 2 years ago

Hi @rudpot

thanks for your comments:

The nature of tasks is that they run a limited amount of time. Therefore the experiment is somewhat time sensitive. I agree with you. I still believe it's a worthwhile learning for our readers to launch/interrupt/observe/improve such a task. We can easily increase the time needed for the task to complete. That will give the reader plenty of time to interrupt the task and observe how it behaves on restart.

I addressed all your feedback.

Detailed changes:

  1. Changed wording and explanations throughout to "checkpointing"
  2. Experiment idea: The idea is that the job will already complete when restarted but it doesn't have any notion of checkpointing. When it's interrupted, it will start from index 0. The IMPROVEMENT is adding checkpointing. I have adjusted the writeup to clarify this.
  3. Added a note that the experiment is not designed to be run with multiple concurrent tasks. The improvement section contains ideas how parallelization can be introduced.
  4. Changed to CloudWatch console for progress/result checking everywhere
  5. Changed code samples in IMPROVEMENT section
  6. Provided reasoning for improvements (time, resource savings, cost)

Here is the compare: https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274

Once I have your green light when it comes to the overall writeup and experiment idea, I will:

  1. Update screenshots
  2. Merge my code into the project and update some sections
  3. Insert links to code into writeup
  4. Address all TODO's