Use initialized EBS storage instead of buffer machines

sanderegg commented 4 months ago

Concept

Instead of keeping running EC2 instances as buffer machine, we would only keep their respective EBS volumes up.

Needed changes

AMI:

current boot script automatically uses the largest disk it finds to mount the docker folder, needs to change to only target EBS
needs to set an EBS disk
currently large EC2s have a larger disk (up to 3.4TB which is free of charge and available to the users), need an equivalent EBS? when yes we need to parametrize this and define what sizes are needed
Autoscaling:
when an EC2 is started it received the "UserData" script that "pre-pulls" docker images such as s4l that is large,
instead of keeping X running buffer machines, it would stop them instead and terminate the ones that are above the buffer number,
- it must wait until the pre-pulling took effect before stopping the machines and I think there is no way but SSH in order to know that and that is not very nice, investigate some other way (maybe start with a hard-coded delay)
- it must handle the disks data (images will accumulate over time and fill the disk), how?
- if we only stop instances, then we need to book keep the available stopped instances. what is the advantage over shutting them down?
- we need to book keep the created volumes and ensure they do not accumulate

### Eisbock
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6227
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6230
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6238
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6242
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6250
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6251
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6252
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/6299
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6254
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/6314

### Tasks
- [x] Modify AMI boot script to optionally skip instance storages
- [x] Have only 1 AMI with 500GB additional EBS disk?
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/6032
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5923
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/6097
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6045
- [x] Tune up disks (throughput + IOPS) on root drive and docker drive --> using maxed out GP3s

matusdrobuliak66 commented 4 months ago

https://depot.dev/blog/faster-ec2-boot-time

sanderegg commented 4 months ago

regarding autoscaling, I see currently 2 option:

pre-create complete EC2s:

start buffer machine of cheap type (such as t3.medium or so)
ensure startup is complete (such as pre-pulling) - using SSH, AWS SSM or hard-coded time
stop machine (only the EBS disks are left to pay (8GB root + 500GB docker))
when there is need for a new machine, first check if any stopped buffer machine is available, if yes set the correct type and start it
when the machine is not needed anymore, instead of shutting it down, it can be passed to the buffer handler
we might need to ensure the disk is cleaned between runs
we need to monitor the EBS volumes/stopped machine and possibly remove them

only keep initialized EBS volumes

start buffer machine of cheap type
ensure startup is complete
shutdown machine but keep EBS volume
when there is need for a new machine, first check if we have free EBS volumes around, if yes use them
monitor EBs volumes
handle cleanup of volumes

sanderegg commented 1 month ago

User story

Prepare background on how this new system works
Show measurements, and how is it changing
Show costs and how it is changing

sanderegg commented 1 month ago

Create a graph of responsiveness vs costs for:

current buffer system
new EBS buffer system

ITISFoundation / osparc-simcore

Use initialized EBS storage instead of buffer machines #5864