Closed sanderegg closed 1 month ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 80.6%. Comparing base (
cafbf96
) to head (a73aef9
). Report is 206 commits behind head on master.
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code
sorry, what does
missing autoscaling ENV variable: EC2_INSTANCES_TIME_BEFORE_TERMINATION
in your PR description mean? Should this variable be removed?
@mrnicegyu11 that variable was in osparc-config but not in the docker-compose in osparc-simcore. Weird, not the end of the world, but now it is in.
What do these changes do?
This PR fixes https://github.com/ITISFoundation/osparc-simcore/issues/4880.
Details
The autoscaling service creates EC2 instances on demand when it detects services with the correct service labels that are missing resources. Normally the EC2 instances are started and they join either the simcore docker swarm (dynamic mode) or the computational cluster docker swarm (computational mode). Usually this process takes about 30 seconds. Sometimes, the EC2 instance somehow fails to join the swarm for whatever reason (the instance is broken - e.g. AWS own checks are failing, or some networking issue arise). All in all these instances stay there forever until manual intervention. They also tend to block the user from running services as the autoscaling still "thinks" the instance will eventually join the party.
This PR now leverage the
EC2_INSTANCES_MAX_START_TIME
ENV variable and if an EC2 instance takes more than this time to join the swarm, it will be terminated right away and if there is still a service waiting then another instance will be created through the usual process. IMPORTANT NOTE: the time it usually takes to start a machine and join the swarm is about 30 seconds. Nevertheless the value used now is 1 minute and shall remain that way so that we do not start terminating EC2s that took slightly more than 30 seconds.Driving tests:
test_long_pending_ec2_is_detected_as_broken_terminated_and_restarted
in services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py and services/autoscaling/tests/unit/test_modules_auto_scaling_dynamic.pyRelated issue/s
How to test
Dev-ops checklist
EC2_INSTANCES_MAX_START_TIME
WORKERS_EC2_INSTANCES_MAX_START_TIME
EC2_INSTANCES_TIME_BEFORE_TERMINATION