Architecture: Make services restartable

Originally discussed in https://github.com/ITISFoundation/osparc-simcore/discussions/5560

Glossary

Definitions

### High availability service - never crashes (sic), - in case of service restart on single machine (e.g. master or last machine standing), the replacement instance first starts until readyness, then the replaced instance shuts down (~= continuous deployment), - redundant (# service instances > 1) distributed on different machines (1 instance fail does not fail the system), - in case of unexpected downtime OPs shall detect it before users do ### Scalable service - able to run in multiple service instances without breaking functionality, - ideally shares the load between the service instances ### Restartable service - the new service can start without breaking itself or the service it replaces (it is running in parallel with the to-be-replaced service for a short amount of time) ### Resumable service - the replacing service restarts all the tasks the replaced service was doing (this imply saving its current state before crashing or switching off), - or the service client identify the service restarted and restart the lost tasks (implies identifying the service was removed, and retry)

Communication among services in oSparc

### REST API requests a service calls a REST API entrypoint of another service which returns a direct response - a REST call has a timeout of X seconds, anything longer fails - no option to get a request progress - a request can fail due to network failure - if the server is restarted or crashed, the request is failed ### oSparc REST Long running task A long running task is a task that typically takes a long time to complete (for example copying S3 files - project copy) and follows the procedure: 1. POST /tasks --> starts a long task, returns the task ID 2. GET /tasks/{id}/status --> gets the status of a task, returns whether the task is running and its progress. 3. GET /tasks/{id}/result --> gets the result of a task - all requests are short (few ms) - returns request progress - a request can fail due to network failure - if the server is restarted or crashed, the request is failed - **if the server is has multiple instances, then the client must talk always with the same instance in order to get the status** ### RPC call through Message broker (for example RabbitMQ) a service calls a function signature in the message broker, which delivers the message to another service which will eventually respond - the caller is agnostic to the callee, it only needs to know the function signature - the broker can be configured to retry distributing the task if the service is not available or restarted or crashed - issue if the broker is overloaded - no option to get a request progress ### RPC long running task (not yet implemented) A long running task would be something along these lines: 1. RPC: create task --> start a long running task, returns the task ID 2. RPC: get task status(ID) --> returns the task status, its progress 3. RPC: get task result(ID) --> returns the task result (Note that this is not the real implementation, it could be a python generator, a celery task or anything else) - the caller is agnostic to the callee, it only needs to know the function signature - the broker can be configured to retry distributing the task if the service is not available or restarted or crashed - issue if the broker is overloaded

Current oSparc issues

## non-scalable services ### storage #### long running tasks: - ```POST /v0/simcore-s3/folders``` - copy folders of a project #### background task: - multipart upload garbage collection (cleans DB and/or S3) - does not prevent restarting/scaling as it relies on DB but generates unwanted traffic ### director-v2 #### background tasks: - dynamic scheduler - **prevent restarting**: - bugs (some of which unknown) and no tests that guarantee it can restart - the list of steps to start/stop a service is: 1) too big and they do too much 2) they were not designed to be restarted - the internal state of a service is "saved" when all steps are done processing (which makes it useless in most cases) -> **director-v2 can be restarted ONLY ONCE no more services are starting or stopping!** - current code that starts and stop services is written so poorly that it makes it very time consuming and very likely to fail when changing anything -> requires a rewrite to make into something usable. - computational scheduler - **unsure, might work but will generate unwanted additional traffic**

### Tasks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5621
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5634
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4524
- [ ] REST client shall identify when a service disappeared and restart tasks

ITISFoundation / osparc-simcore

Architecture: Make services restartable #5614

Originally discussed in https://github.com/ITISFoundation/osparc-simcore/discussions/5560

Glossary