There are a couple of nuisances with our current worker strategy, that I think would be helped by moving most of what we've done to an orchestration system like ECS.
Log streams are currently generated by instance, so the logs from all N workers get interleaved (making it hard to find errors)
When a worker crashes, it never gets replaced.
Updating the container images is a right pain. docker kill, docker rm, docker pull, find / -name part-001, (cloud-init script) /path/to/part-001. vs. pushing a new launch template and having fresh images in a couple of minutes.
Ugly names for the ASGs (means we have to "discover" the names to do adjust desired sizes), like tf-asg-tf-serratus-dl-20200304125312000001, this is currently necessary, so that all instances get replaced when we change the user_data in the launch configuration, ECS would deal with sending the correct arguments to our scripts.
There are a couple things to work out though, first:
[ ] will we use Daemon or Replication jobs? Daemon doesn't solve 1, but replication doesn't solve 4. We need a way to force all images to be replaced if we change them.
There are a couple of nuisances with our current worker strategy, that I think would be helped by moving most of what we've done to an orchestration system like ECS.
docker kill
,docker rm
,docker pull
,find / -name part-001
, (cloud-init script)/path/to/part-001
. vs. pushing a new launch template and having fresh images in a couple of minutes.tf-asg-tf-serratus-dl-20200304125312000001
, this is currently necessary, so that all instances get replaced when we change the user_data in the launch configuration, ECS would deal with sending the correct arguments to our scripts.There are a couple things to work out though, first: