Definitions
### High availability service
- never crashes (sic),
- in case of service restart on single machine (e.g. master or last machine standing), the replacement instance first starts until readyness, then the replaced instance shuts down (~= continuous deployment),
- redundant (# service instances > 1) distributed on different machines (1 instance fail does not fail the system),
- in case of unexpected downtime OPs shall detect it before users do
### Scalable service
- able to run in multiple service instances without breaking functionality,
- ideally shares the load between the service instances
### Restartable service
- the new service can start without breaking itself or the service it replaces (it is running in parallel with the to-be-replaced service for a short amount of time)
### Resumable service
- the replacing service restarts all the tasks the replaced service was doing (this imply saving its current state before crashing or switching off),
- or the service client identify the service restarted and restart the lost tasks (implies identifying the service was removed, and retry)
Communication among services in oSparc
### REST API requests
a service calls a REST API entrypoint of another service which returns a direct response
- a REST call has a timeout of X seconds, anything longer fails
- no option to get a request progress
- a request can fail due to network failure
- if the server is restarted or crashed, the request is failed
### oSparc REST Long running task
A long running task is a task that typically takes a long time to complete (for example copying S3 files - project copy) and follows the procedure:
1. POST /tasks --> starts a long task, returns the task ID
2. GET /tasks/{id}/status --> gets the status of a task, returns whether the task is running and its progress.
3. GET /tasks/{id}/result --> gets the result of a task
- all requests are short (few ms)
- returns request progress
- a request can fail due to network failure
- if the server is restarted or crashed, the request is failed
- **if the server is has multiple instances, then the client must talk always with the same instance in order to get the status**
### RPC call through Message broker (for example RabbitMQ)
a service calls a function signature in the message broker, which delivers the message to another service which will eventually respond
- the caller is agnostic to the callee, it only needs to know the function signature
- the broker can be configured to retry distributing the task if the service is not available or restarted or crashed
- issue if the broker is overloaded
- no option to get a request progress
### RPC long running task (not yet implemented)
A long running task would be something along these lines:
1. RPC: create task --> start a long running task, returns the task ID
2. RPC: get task status(ID) --> returns the task status, its progress
3. RPC: get task result(ID) --> returns the task result
(Note that this is not the real implementation, it could be a python generator, a celery task or anything else)
- the caller is agnostic to the callee, it only needs to know the function signature
- the broker can be configured to retry distributing the task if the service is not available or restarted or crashed
- issue if the broker is overloaded
Current oSparc issues
## non-scalable services
### storage
#### long running tasks:
- ```POST /v0/simcore-s3/folders``` - copy folders of a project
#### background task:
- multipart upload garbage collection (cleans DB and/or S3) - does not prevent restarting/scaling as it relies on DB but generates unwanted traffic
### director-v2
#### background tasks:
- dynamic scheduler - **prevent restarting**:
- bugs (some of which unknown) and no tests that guarantee it can restart
- the list of steps to start/stop a service is: 1) too big and they do too much 2) they were not designed to be restarted
- the internal state of a service is "saved" when all steps are done processing (which makes it useless in most cases) -> **director-v2 can be restarted ONLY ONCE no more services are starting or stopping!**
- current code that starts and stop services is written so poorly that it makes it very time consuming and very likely to fail when changing anything -> requires a rewrite to make into something usable.
- computational scheduler - **unsure, might work but will generate unwanted additional traffic**
### Tasks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5621
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5634
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4524
- [ ] REST client shall identify when a service disappeared and restart tasks
Originally discussed in https://github.com/ITISFoundation/osparc-simcore/discussions/5560
Glossary
Definitions
### High availability service - never crashes (sic), - in case of service restart on single machine (e.g. master or last machine standing), the replacement instance first starts until readyness, then the replaced instance shuts down (~= continuous deployment), - redundant (# service instances > 1) distributed on different machines (1 instance fail does not fail the system), - in case of unexpected downtime OPs shall detect it before users do ### Scalable service - able to run in multiple service instances without breaking functionality, - ideally shares the load between the service instances ### Restartable service - the new service can start without breaking itself or the service it replaces (it is running in parallel with the to-be-replaced service for a short amount of time) ### Resumable service - the replacing service restarts all the tasks the replaced service was doing (this imply saving its current state before crashing or switching off), - or the service client identify the service restarted and restart the lost tasks (implies identifying the service was removed, and retry)Communication among services in oSparc
### REST API requests a service calls a REST API entrypoint of another service which returns a direct response - a REST call has a timeout of X seconds, anything longer fails - no option to get a request progress - a request can fail due to network failure - if the server is restarted or crashed, the request is failed ### oSparc REST Long running task A long running task is a task that typically takes a long time to complete (for example copying S3 files - project copy) and follows the procedure: 1. POST /tasks --> starts a long task, returns the task ID 2. GET /tasks/{id}/status --> gets the status of a task, returns whether the task is running and its progress. 3. GET /tasks/{id}/result --> gets the result of a task - all requests are short (few ms) - returns request progress - a request can fail due to network failure - if the server is restarted or crashed, the request is failed - **if the server is has multiple instances, then the client must talk always with the same instance in order to get the status** ### RPC call through Message broker (for example RabbitMQ) a service calls a function signature in the message broker, which delivers the message to another service which will eventually respond - the caller is agnostic to the callee, it only needs to know the function signature - the broker can be configured to retry distributing the task if the service is not available or restarted or crashed - issue if the broker is overloaded - no option to get a request progress ### RPC long running task (not yet implemented) A long running task would be something along these lines: 1. RPC: create task --> start a long running task, returns the task ID 2. RPC: get task status(ID) --> returns the task status, its progress 3. RPC: get task result(ID) --> returns the task result (Note that this is not the real implementation, it could be a python generator, a celery task or anything else) - the caller is agnostic to the callee, it only needs to know the function signature - the broker can be configured to retry distributing the task if the service is not available or restarted or crashed - issue if the broker is overloadedCurrent oSparc issues
## non-scalable services ### storage #### long running tasks: - ```POST /v0/simcore-s3/folders``` - copy folders of a project #### background task: - multipart upload garbage collection (cleans DB and/or S3) - does not prevent restarting/scaling as it relies on DB but generates unwanted traffic ### director-v2 #### background tasks: - dynamic scheduler - **prevent restarting**: - bugs (some of which unknown) and no tests that guarantee it can restart - the list of steps to start/stop a service is: 1) too big and they do too much 2) they were not designed to be restarted - the internal state of a service is "saved" when all steps are done processing (which makes it useless in most cases) -> **director-v2 can be restarted ONLY ONCE no more services are starting or stopping!** - current code that starts and stop services is written so poorly that it makes it very time consuming and very likely to fail when changing anything -> requires a rewrite to make into something usable. - computational scheduler - **unsure, might work but will generate unwanted additional traffic**