Open rudpot opened 2 years ago
Hi @rudpot, I'd be happy to add this section. I'll run through the workshop later today and I could then outline my ideas. Is that ok?
I see that the workshop contains a section about ECS, using the aws:ec2:terminate-instances
on the underlying ASG of the ECS cluster. Then it guides to reader to implement a minimum ASG size of 2 and to increase the task count.
I can think of two options to include stop-task:
Continuing after the FisWorkshopECS
experiment, we setup another experiment with stop-task and let the reader observe how the website stays available when we kill 50% of the tasks belonging to the service. This is in-line with the EC2 experiment.
What do you think @rudpot?
Ideally we would demonstrate this on ECS/Fargate instead of ECS/EC2.
I think both are viable and the latter would demonstrate good resilience best practices but is more work. If you want to use some pre-existing materials, this might be helpful: https://github.com/gunnargrosch/serverless-chaos-demo-nodejs in conjunction with Gunnar's talk at conf42.
There is also the distinction whether the task is part of a service (in line with EC2 experiment) or a standalone worker task so if you think there is a real-world use case where you might want to kill of a standalone task that's an option.
Finally if you want to include some commentary on ECS with docker compose that could be interesting. I expect that this will just generate tasks with multiple containers attached so it probably doesn't make a difference but worth exploring.
Hey @rudpot, I agree we should use Fargate.
I thought about how to show a succint but still realistic scenario with a worker task and wrote a few lines of code yesterday. Here is my idea: We provide a CDK app that defines a Fargate task with a little Node app, simulating a simple batch job. This batch job writes its results into DynamoDB. The reader starts the experiment to kill the task before finishing. The expectation when restarting the task should be that it continues where it was stopped, instead of starting at 0.
I wrote a little app that just writes an incrementing number into Dynamo until it reaches batchJobSize
. We can let the reader inspect the progress on the DynamoDB console or through CloudWatch.
Thoughts?
It's a good way to create visibility for the experiment here a few thoughts questions for implementing that:
Sure tracking progress in cloudwatch works.
Regarding restart mechanisms, that was just one thought. If you do build in a restart it would need to be external to the task because the task is being killed.
Sent from Workspace ONE Boxer
On Apr. 2, 2022 03:13, SH4DY @.***> wrote:
- Reply to this email directly, view it on GitHubhttps://github.com/aws-samples/aws-fault-injection-simulator-workshop/issues/274#issuecomment-1086597909, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL64MGTFIRKLFRQAPITNNPTVDAFRRANCNFSM5OEUTGLQ. You are receiving this because you were mentioned.Message ID: @.***>
Quick update here:
ECS
and subchapter ECS Tasks
on the same level as ECS Service
(= existing subchapter). ECS Tasks
following the "experiment - observe - fix/repeat" patternWill post the instructions for review later this week.
Still ToDo:
Would you mind putting this on a branch like sh4dy/fix-274
and pushing it up to the repo - that makes it easier to comment / collaborate.
Hi Rudolf!
Sure, I've put everything into branch fix-274
in my fork. You can check it here.
I could also create a PR against fix-274
in this repo, but you would need to create the branch for me. I don't think I have permissions for that.
Currently you will find the new chapter structure and instructions in fix-274
. I have not committed CDK code yet. There are quite a few file changes because I moved some folders around in the ECS chapter....
Hi @rudpot I made some more progress here.
Here is the compare with all commits: https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274
Hi SH4DY!
I don’t have a good way to comment inline so doing this by email – my biggest concern is that your workflow is sensitive to how quickly the workshop user starts the ECS tasks and the FIS experiments and there are a lot of race condition edge cases that will leave the user with a bad experience. Please review and improve.
Amazon Elastic Container Service (ECS) is a fully managed container orchestration service that helps you easily deploy, manage, and scale containerized applications. It deeply integrates with the rest of the AWS platform to provide a secure and easy-to-use solution for running container workloads in the cloud and now on your infrastructure with Amazon ECS Anywhere.
In this section you will find two experiments. In the first you will spin up a web service on ECS and use FIS to explore and improve its resilience. In the second, you will work with an invoked ECS task that represents something like a recurring batch job that needs to be restarted by external automation on any failure. You will use FIS to simulate a failure by stopping the task, and use the learnings to improve overall performance on the task.
Note: what happens to your experiment design if multiple instances of the task are started concurrently? Probably BadThings™ - my suggestion would be to either make the invocation idempotent so concurrency is not a problem or to at least comment on the problems with the experiment design, maybe in the Learning&Improving section.
Overall can you rework the experiment idea / learnings approach a little? Option 1: Do you simply assume that restarting the job will eventually lead to completion and the learning is that this is suboptimal and you want to add checkpointing (Spot instance example)? In that case update the hypothesis and learnings. Option 2: Do you assume that the task is already capable of restarting but the source code contains a mistake that prevents that? In that case your code should already start with a continueAt() function that’s broken, e.g. by ignoring the cont argument.
Specific issues to address
* Fix “graceful degradation” throughout
* Validation procedure: “run task successfully” isn’t a well-defined validation procedure and the sequence of actions is not really clear. What is the workflow you expect the human to perform? Start task, (immediately or 10min later) start FIS experiment (could the user be fast enough that there is no log record before the task gets killed? Could the user be slow enough that there is no longer a task so the associated log record can’t be seen)? What constitutes “success”? Also if the user lets it run to completion the first time why does it work the second time (see the DB clear action in learning and improving)?
Cheers,
Rudolf
Rudolf Potucek Senior Solutions Architect, Startup AWS Solutions Architecture @.**@.> | 403 903 4980
[signature_433175265]
Thoughts on our interaction? Provide Feedback herehttps://feedback.aws.amazon.com/?ea=rudpot&fn=Rudolf&ln=Potucek.
From: SH4DY @.> Reply-To: aws-samples/aws-fault-injection-simulator-workshop @.> Date: Tuesday, May 17, 2022 at 07:38 To: aws-samples/aws-fault-injection-simulator-workshop @.> Cc: "Potucek, Rudolf" @.>, Mention @.***> Subject: Re: [aws-samples/aws-fault-injection-simulator-workshop] Extend ECS section to cover stop-task action (Issue #274)
Hi @rudpothttps://github.com/rudpot I made some more progress here.
Here is the compare with all commits: main...SH4DY:fix-274https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274
— Reply to this email directly, view it on GitHubhttps://github.com/aws-samples/aws-fault-injection-simulator-workshop/issues/274#issuecomment-1128882531, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL64MGX7BRTLNQPSEGFQUBLVKOONXANCNFSM5OEUTGLQ. You are receiving this because you were mentioned.Message ID: @.***>
Hi @rudpot
thanks for your comments:
The nature of tasks is that they run a limited amount of time. Therefore the experiment is somewhat time sensitive. I agree with you. I still believe it's a worthwhile learning for our readers to launch/interrupt/observe/improve such a task. We can easily increase the time needed for the task to complete. That will give the reader plenty of time to interrupt the task and observe how it behaves on restart.
I addressed all your feedback.
Detailed changes:
Here is the compare: https://github.com/aws-samples/aws-fault-injection-simulator-workshop/compare/main...SH4DY:fix-274
Once I have your green light when it comes to the overall writeup and experiment idea, I will:
Hi All, the FIS service and console has launched a new action, aws:ecs:stop-task (ECS Stop Tasks) in all commercial regions- see documentation below! User Guide https://docs.aws.amazon.com/fis/latest/userguide/getting-started-iam-service-role.html https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html#ecs-actions-reference https://docs.aws.amazon.com/fis/latest/userguide/targets.html https://docs.aws.amazon.com/fis/latest/userguide/using-service-linked-roles.html https://docs.aws.amazon.com/fis/latest/userguide/security-iam-awsmanpol.html API Reference https://docs.aws.amazon.com/fis/latest/APIReference/API_Operations.html