mguidon commented 1 year ago

Description

With the latest version of sim4life.io, we are introducing an improved computational backend that ensures reliable and efficient job scheduling via the computational backend. Moving forward, all solver jobs will be scheduled via these facilities, enabling users to choose the hardware on which their jobs should run and providing the ability to inspect and operate on the job queue (subject to sufficient permissions).

This robust backend will be capable of handling 100s of concurrent jobs, ensuring that even the busiest periods will not cause any disruptions to service.

Furthermore, the backend functionality will also be made available through the API, allowing for integration with external systems (e.g. the sim4life desktop application) and further expanding the possibilities for users.

## Tasks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4643
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4530
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4525
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/921
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/982
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/617
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/3999
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5094
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4524
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5073
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5074
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1196
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5000
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5293
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5436
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4880
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5336
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4526

### Enchanted Odyssey
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5493

### Schoggilebe
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5497
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5437
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5294
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5339
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5290
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1277
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5403

### This is Sparta!
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5218
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4727
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1218
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1219
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5251
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5203
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5237
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5261
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5264
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5149
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5252
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5287

### Kobayashi Maru
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5087
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1181
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5071
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5024
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5101
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5129
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5146
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5108
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5120
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5141
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5147
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5155
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5162
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5164
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5163
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5165
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5167
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5195
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5204
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5201
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5076

### 7Peaks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4159
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4958
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4781
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/621
- [x] Preferences: add preferences for max number of concurrent jobs
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1180
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4999
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/4975
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5008
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5010
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5013
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5042
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5054
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5018
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5025
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5031
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5026
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5032
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/5066

### Microhistory
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1034
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/4915
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/4930
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/3209

### Quilmes
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4517
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4756
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1126
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4376

### Sundae
- [ ] https://github.com/ITISFoundation/osparc-simcore/pull/4429
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4153
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4523

### Baklava
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/4637

sanderegg commented 1 year ago

Goal for sprint Pastel de Nata

progress on AppTeam Std Simulations, ideally run CF use-case
refactoring on computational backend, progress on separating PublicAPI calls from webserver load, return solver progress
progress on Public API missing entrypoints, and bug fixes
if possible progress on personalized resource limits

sanderegg commented 1 year ago

Update for sprint Pastel de Nata

Done:

Before on prod (05/31/2023) - all instances of webserver are loaded the same because they are coupled through logs/progress/db handling:

After on master (05/31/2023) - each instance is less coupled, still waiting for the ongoing PR to completely separate them:

mguidon commented 1 year ago

Update Watermelon

Done:

Personalized-resource-limits: User can now select desired resource requirements to run computational and dynamic services
https://github.com/ITISFoundation/osparc-simcore/issues/4271
https://github.com/ITISFoundation/osparc-simcore/issues/4350

Ongoing:

Robustness improvements/refactoring

sanderegg commented 1 year ago

Update Sundae

Done:

bugfixes #4153
connection of computational backend to resource usage tracking service #4523
new clusters keeper service to automatically create computational clusters in AWS #4591

Ongoing:

connect clusters-keeper service to oSparc and create computational clusters on the fly
https://github.com/ITISFoundation/osparc-issues/issues/621
https://github.com/ITISFoundation/osparc-issues/issues/885
Robustness improvements/Bugfixes from usage

sanderegg commented 1 year ago

Update Baklava

Done:

https://github.com/ITISFoundation/osparc-simcore/issues/4637

Ongoing:

https://github.com/ITISFoundation/osparc-simcore/issues/4159
- https://github.com/ITISFoundation/osparc-simcore/issues/4521
- https://github.com/ITISFoundation/osparc-simcore/issues/4522

sanderegg commented 1 year ago

The below schema shows the overall architecture for the on-demand clusters. Some important points here are:

the computational clusters are created per user/wallet
in case of maintenance in simcore, these clusters shall be able to continue running independently

sanderegg commented 11 months ago

Update Quilmes

Done:

Ongoing:

https://github.com/ITISFoundation/osparc-simcore/issues/4522

sanderegg commented 10 months ago

Update Microhistory

Done and working

Separate cluster is created for each set of user/wallet combination on demand in Amazon AWS,
Cluster is a primary machine (t2.micro), on which a stack containing dask-scheduler, autoscaling, redis, dask-sidecar services is started, dask-sidecar only runs on worker machines,
autoscaling service creates 1 worker machine (g4dn.xlarge),
Only computational services that use a pricing unit defined as g4dn.xlarge machine can run,
computational service uses the all the resources provided by the machine (a bit less than 16Gb/4CPUs)

--> Running computational service should work for one service at a time, provided they are set up to use a g4dn.xlarge machine type, there is no upscaling of the machines so parallel jobs will have to wait in line (if multiple isolve jobs are sent, they will be executed one after the other).

should work in 3 weeks

Cluster shall create correct machine based on plan (not just g4dn.xlarge), so potentially better machine fit/performance,
identify computational child jobs (for example started from s4l) and show them in UI
maybe upscaling of separate cluster (needs discussions on how to do it, it has influence on costs, etc)

should not be available in 3 weeks

upscaling?
optimisations

sanderegg commented 9 months ago

Update 7peaks

Summary

It is now possible to run computational services on their required AWS instance types. Also child computational job logs show up in the logs of the parent service (e.g. sim4life/jupyterlab starting a computational job). Upscaling is still not implemented.

Done

Ongoing

bugfixing
improvements on user feedback (cluster status, number of machines, etc...)

sanderegg commented 8 months ago

Update Kobayashi Maru

Summary

bugfixes:
- handling of on-demand computational clusters (timeouts, reported states)
- concurrent computing of tasks
monitoring & manual interventions:
- CLI tool to monitor on-demand computational clusters and dynamic service machines
- partially clear jobs in a specific cluster
- allow tracing of created machines via tags on EC2 instances

Done ✅

various fixes for GPU-based computational services on multi-GPU machines
migration of sleepers test to Playwright framework to have more reliable and more flexible E2E testing and compatibility with on-demand computational clusters
various fixes regarding invalid state reported by the computational clusters
added timeout in case of non responding cluster for more than 10 minutes
improvement of response time when retrieving the computational clusters state via Public API
new CLI-based monitoring tool to check current state of auto-scaled EC2 instances and their running states

Problematic issues (being worked on) 🚧

Tasks not optimaly distributed to all machines
Machines not properly shut down - unnecesary costs:
- https://github.com/ITISFoundation/osparc-issues/issues/1219
- https://github.com/ITISFoundation/osparc-issues/issues/1218

Open Features 🚧

sanderegg commented 7 months ago

Update This is Sparta!

Summary

Multi-processing is now working as expected (10 machines will take 10 jobs)
Added facilities for tracing jobs/logs
Secure transmission of data in the computational backend
Stability improvements

Done ✅

Tasks not optimaly distributed to all machines
Machines not properly shut down - unnecesary costs:
- https://github.com/ITISFoundation/osparc-issues/issues/1219
- https://github.com/ITISFoundation/osparc-issues/issues/1218
https://github.com/ITISFoundation/osparc-simcore/issues/4727

Ongoing 🚧

Open Features 🚧

sanderegg commented 5 months ago

Update Schoggilebe

Summary

Using Ansible to create AMIs for AWS (reproducibility of machine images) for both dynamic/computational autoscaled machiens
Authorize deployment of computational cluster in different AWS regions
Improve labeling of machines, network and volumes for better cost management

ITISFoundation / osparc-issues

sim4life.io - WP4: Computational backend #950

Description

Goal for sprint Pastel de Nata

Update for sprint Pastel de Nata

Update Watermelon

Update Sundae

Update Baklava

Update Quilmes

Update Microhistory

Done and working

should work in 3 weeks

should not be available in 3 weeks

Update 7peaks

Summary

Done

Ongoing

Update Kobayashi Maru

Summary

Done ✅

Problematic issues (being worked on) 🚧

Open Features 🚧

Update This is Sparta!

Summary

Done ✅

Ongoing 🚧

Open Features 🚧

Update Schoggilebe

Summary

Done

Ongoing