ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
44 stars 26 forks source link

✨Autoscaling: terminate long pending EC2s #5832

Closed sanderegg closed 1 month ago

sanderegg commented 1 month ago

What do these changes do?

This PR fixes https://github.com/ITISFoundation/osparc-simcore/issues/4880.

Details

The autoscaling service creates EC2 instances on demand when it detects services with the correct service labels that are missing resources. Normally the EC2 instances are started and they join either the simcore docker swarm (dynamic mode) or the computational cluster docker swarm (computational mode). Usually this process takes about 30 seconds. Sometimes, the EC2 instance somehow fails to join the swarm for whatever reason (the instance is broken - e.g. AWS own checks are failing, or some networking issue arise). All in all these instances stay there forever until manual intervention. They also tend to block the user from running services as the autoscaling still "thinks" the instance will eventually join the party.

This PR now leverage the EC2_INSTANCES_MAX_START_TIME ENV variable and if an EC2 instance takes more than this time to join the swarm, it will be terminated right away and if there is still a service waiting then another instance will be created through the usual process. IMPORTANT NOTE: the time it usually takes to start a machine and join the swarm is about 30 seconds. Nevertheless the value used now is 1 minute and shall remain that way so that we do not start terminating EC2s that took slightly more than 30 seconds.

Driving tests:

test_long_pending_ec2_is_detected_as_broken_terminated_and_restarted in services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py and services/autoscaling/tests/unit/test_modules_auto_scaling_dynamic.py

Related issue/s

How to test

Dev-ops checklist

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 80.6%. Comparing base (cafbf96) to head (a73aef9). Report is 206 commits behind head on master.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832/graphs/tree.svg?width=650&height=150&src=pr&token=h1rOE8q7ic&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation)](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) ```diff @@ Coverage Diff @@ ## master #5832 +/- ## ========================================= - Coverage 84.5% 80.6% -4.0% ========================================= Files 10 1366 +1356 Lines 214 56724 +56510 Branches 25 1284 +1259 ========================================= + Hits 181 45755 +45574 - Misses 23 10702 +10679 - Partials 10 267 +257 ``` | [Flag](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) | Coverage Δ | | |---|---|---| | [integrationtests](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) | `58.4% <ø> (?)` | | | [unittests](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) | `79.7% <100.0%> (-4.9%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) | Coverage Δ | | |---|---|---| | [...g/src/simcore\_service\_autoscaling/core/settings.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Fcore%2Fsettings.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy9jb3JlL3NldHRpbmdzLnB5) | `100.0% <100.0%> (ø)` | | | [...oscaling/src/simcore\_service\_autoscaling/models.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Fmodels.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy9tb2RlbHMucHk=) | `100.0% <100.0%> (ø)` | | | [...e\_service\_autoscaling/modules/auto\_scaling\_core.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Fmodules%2Fauto_scaling_core.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy9tb2R1bGVzL2F1dG9fc2NhbGluZ19jb3JlLnB5) | `94.8% <100.0%> (ø)` | | | [...c/simcore\_service\_clusters\_keeper/core/settings.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832?src=pr&el=tree&filepath=services%2Fclusters-keeper%2Fsrc%2Fsimcore_service_clusters_keeper%2Fcore%2Fsettings.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvY2x1c3RlcnMta2VlcGVyL3NyYy9zaW1jb3JlX3NlcnZpY2VfY2x1c3RlcnNfa2VlcGVyL2NvcmUvc2V0dGluZ3MucHk=) | `96.1% <ø> (ø)` | | ... and [1338 files with indirect coverage changes](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5832/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation)
sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

sanderegg commented 1 month ago

sorry, what does missing autoscaling ENV variable: EC2_INSTANCES_TIME_BEFORE_TERMINATION in your PR description mean? Should this variable be removed?

@mrnicegyu11 that variable was in osparc-config but not in the docker-compose in osparc-simcore. Weird, not the end of the world, but now it is in.