ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
44 stars 26 forks source link

✨Autoscaling: Drain node before terminating #5846

Closed sanderegg closed 1 month ago

sanderegg commented 1 month ago

What do these changes do?

After the changes that stopped using docker node drain a new issue arose: container IPs are not properly returned to the docker swarm https://github.com/ITISFoundation/osparc-ops-environments/issues/665.

So it is preferable to first drain a node before terminating it.

With this PR, the termination process was complexified. Once a node is deemed as terminateable (after empty time exceeds EC2_INSTANCES_TIME_BEFORE_TERMINATION) the autoscaling service will begin the "termination" process.

BEFORE:

AFTER:

NOTE: Once the termination process is started, there is no way back. The termination process will label the node for termination by docker labelling the node with io.simcore.osparc-node-termination-started which contains the timepoint when this happens. That means that if for some reason the docker engine is not responding, there might be an accumulation of EC2 instances. If this becomes a problem then the tagging might work by using the EC2 API instead. NOTE2: EC2_INSTANCES_TIME_BEFORE_FINAL_TERMINATION is not defined as an ENV variable and is currently hard-coded as there is no foreseeable motivation to have it changeable at the moment.

Related issue/s

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 66.8%. Comparing base (cafbf96) to head (da76a7a). Report is 214 commits behind head on master.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846/graphs/tree.svg?width=650&height=150&src=pr&token=h1rOE8q7ic&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation)](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) ```diff @@ Coverage Diff @@ ## master #5846 +/- ## ========================================= - Coverage 84.5% 66.8% -17.8% ========================================= Files 10 585 +575 Lines 214 29835 +29621 Branches 25 205 +180 ========================================= + Hits 181 19948 +19767 - Misses 23 9835 +9812 - Partials 10 52 +42 ``` | [Flag](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) | Coverage Δ | | |---|---|---| | [integrationtests](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) | `65.1% <ø> (?)` | | | [unittests](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) | `97.1% <100.0%> (+12.5%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation) | Coverage Δ | | |---|---|---| | [...g/src/simcore\_service\_autoscaling/core/settings.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Fcore%2Fsettings.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy9jb3JlL3NldHRpbmdzLnB5) | `100.0% <100.0%> (ø)` | | | [...oscaling/src/simcore\_service\_autoscaling/models.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Fmodels.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy9tb2RlbHMucHk=) | `100.0% <100.0%> (ø)` | | | [...e\_service\_autoscaling/modules/auto\_scaling\_core.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Fmodules%2Fauto_scaling_core.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy9tb2R1bGVzL2F1dG9fc2NhbGluZ19jb3JlLnB5) | `94.3% <100.0%> (ø)` | | | [...scaling/modules/auto\_scaling\_mode\_computational.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Fmodules%2Fauto_scaling_mode_computational.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy9tb2R1bGVzL2F1dG9fc2NhbGluZ19tb2RlX2NvbXB1dGF0aW9uYWwucHk=) | `89.6% <ø> (ø)` | | | [...ore\_service\_autoscaling/utils/auto\_scaling\_core.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Futils%2Fauto_scaling_core.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy91dGlscy9hdXRvX3NjYWxpbmdfY29yZS5weQ==) | `93.3% <100.0%> (ø)` | | | [.../simcore\_service\_autoscaling/utils/utils\_docker.py](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846?src=pr&el=tree&filepath=services%2Fautoscaling%2Fsrc%2Fsimcore_service_autoscaling%2Futils%2Futils_docker.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation#diff-c2VydmljZXMvYXV0b3NjYWxpbmcvc3JjL3NpbWNvcmVfc2VydmljZV9hdXRvc2NhbGluZy91dGlscy91dGlsc19kb2NrZXIucHk=) | `100.0% <100.0%> (ø)` | | ... and [589 files with indirect coverage changes](https://app.codecov.io/gh/ITISFoundation/osparc-simcore/pull/5846/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ITISFoundation)
YuryHrytsuk commented 1 month ago

Thank you very much. Great job.

I would consider having a longer timeout only because I remember 30 sec * 1/2/3 times of idle state in docker engine that we observed while fixing slow starting time. So, at least 30 sec to feel more secure.

But this all is more like a guessing. I don't have solid proofs or good number for that

sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud