Auto-monitor as option - Githubissues

bethac07 commented 1 year ago

One obvious downside of the monitor is that it needs to be running to work, so a) you have to remember to run it and b) if the machine it's on goes down, it's not running anymore.

In general, we had rejected using lambdas for the monitor, because they can only run for 15 minutes - in theory, though, if we had an existing monitor lambda function, what we could have each DS "startCluster" step do is to start a cron job with the monitor file parameters that triggers that lambda every (1,5,etc) minutes - that lambda would check the designated stuff, if everything is running do nothing, and when done clean up the stuff (which takes less than 15 minutes), including the cron job.

I think we would want this as optional, for two major reasons

Lambdas do cost money - if all we're doing is checking the state of the queue and the spot fleet, it should be able to run pretty quickly and on the smallest possible machine, but I haven't yet back-of-the-enveloped the expected costs. I can't think they'll be thousands of dollars but they might not be 0 either.
Sometimes, it's useful to be able to temporarily shut the monitor off - maybe others always start their jobs perfectly on the first try, but sometimes on mine I realize there's something bad going on and need to empty the queues and reboot the Dockers but don't actually want to do a full-on infrastructure cleanup and re-deployment. That's easy when it's just ctl+c but harder for a cron job.

(Personally, for me, I think that there is one additional, un-quantifiable benefit to having monitor be a step the user executes - reinforcing to users that teardown is a thing that needs to happen and not have them just blindly trust that it has. Even the best written, most-debugged auto-teardown code (whether written by us or an Amazon native service is going to have a day where eventually it just barfs, and so I would rather have the implication that the responsibility for teardown is clearly placed where it belongs, on the person who spun it up in the first place. But that might be a "me shaking my fist at kids these days, just blindly trusting their stuff will work, in my day, nothing was automated and we checked things by hand, uphill in the snow both ways, etc".)

What do you think @ErinWeisbart?

bethac07 commented 1 year ago

Would probably want to be doing #2 at the same time, because this will be annoying for users to set up

bethac07 commented 1 year ago

@ErinWeisbart and I remembered that one of the two of us (which is unclear) already had thought this through as part of AuSPICES nine months ago, and that we realized at the time that a very nice way to trigger it is as an alarm on the queue, because then there's no need for ongoing checks. It also requires uploading the monitor file to the bucket.

Ongoing checks are nice for the auto-downscaling, so we might think about doing some back of the envelope calculations of how much a "lambda every X minutes for Y time" might cost vs "alarm for Y time with assuming we knock off 5% of the compute costs with an auto-downscale", but certainly alarms are more elegant. This means 2 extra steps in the initial setup (an SNS topic and a lambda), so again, we likely want to do #2

DistributedScience / Distributed-Something

Auto-monitor as option #23