DistributedScience / Distributed-Something

Run encapsulated docker containers that do... something in the Amazon Web Services infrastructure.
https://distributedscience.github.io/Distributed-Something
Other
7 stars 3 forks source link

Auto-monitor as option #23

Closed bethac07 closed 1 year ago

bethac07 commented 1 year ago

One obvious downside of the monitor is that it needs to be running to work, so a) you have to remember to run it and b) if the machine it's on goes down, it's not running anymore.

In general, we had rejected using lambdas for the monitor, because they can only run for 15 minutes - in theory, though, if we had an existing monitor lambda function, what we could have each DS "startCluster" step do is to start a cron job with the monitor file parameters that triggers that lambda every (1,5,etc) minutes - that lambda would check the designated stuff, if everything is running do nothing, and when done clean up the stuff (which takes less than 15 minutes), including the cron job.

I think we would want this as optional, for two major reasons

(Personally, for me, I think that there is one additional, un-quantifiable benefit to having monitor be a step the user executes - reinforcing to users that teardown is a thing that needs to happen and not have them just blindly trust that it has. Even the best written, most-debugged auto-teardown code (whether written by us or an Amazon native service is going to have a day where eventually it just barfs, and so I would rather have the implication that the responsibility for teardown is clearly placed where it belongs, on the person who spun it up in the first place. But that might be a "me shaking my fist at kids these days, just blindly trusting their stuff will work, in my day, nothing was automated and we checked things by hand, uphill in the snow both ways, etc".)

What do you think @ErinWeisbart?

bethac07 commented 1 year ago

Would probably want to be doing #2 at the same time, because this will be annoying for users to set up

bethac07 commented 1 year ago

@ErinWeisbart and I remembered that one of the two of us (which is unclear) already had thought this through as part of AuSPICES nine months ago, and that we realized at the time that a very nice way to trigger it is as an alarm on the queue, because then there's no need for ongoing checks. It also requires uploading the monitor file to the bucket.

Ongoing checks are nice for the auto-downscaling, so we might think about doing some back of the envelope calculations of how much a "lambda every X minutes for Y time" might cost vs "alarm for Y time with assuming we knock off 5% of the compute costs with an auto-downscale", but certainly alarms are more elegant. This means 2 extra steps in the initial setup (an SNS topic and a lambda), so again, we likely want to do #2