Architecture and integration design for OpenM++ on dev cluster

chuckbelisle commented 1 year ago

A system that provisions the OpenM++ framework into some type of cloud-based deployment, either via VM or containerized.

Diagram
Infrastructure
- Namespace
- Node pool

chuckbelisle commented 1 year ago

@jacek-dudek, please update this ticket with any research or elaboration that was made.

jacek-dudek commented 1 year ago

Carried out some learning activities on Terraform, Docker, Kubernetes, OpenM++ web service.

Created a Dockerfile for a basic containerized deployment of OpenM++ and for running its web service on start up.

Uploaded Dockerfile to Docker Hub registry.

Created a basic Kubernetes cluster deployment on Azure using Terraform.

Created manifest files for the OpenM++ container and a load balancer to publish the application.

Confirmed that the basic setup runs successfully.

chuckbelisle commented 1 year ago

Next steps

[x] Create new repo under StatCan org in Github.com openmppand clone jacek-dudek/openmpp-on-k8s as a baseline
[x] Define MVP along with architecture diagram (review Steve Gribble's suggestion and documentation)
[ ] Create openmpp infrastructure on AAW dev cluster (namespace or use system NS, netpol, storage, node pool, etc)
[ ] Work on deployment of the PoC

jacek-dudek commented 1 year ago

Clarifying project direction and deliverables: We decided to work towards implementing a cloud offering that has feature parity with the existing microsimulation web service operated by the OpenM++ team on GCP.

Progress made: Did some more background reading of Kubernetes documentation. Identified Kubernetes objects that will be needed in subsequent iterations of the service. Located a github project that appears to be an implementation of OpenMPI on Kubernetes. URL for project: https://github.com/everpeace/kube-openmpi

Souheil-Yazji commented 1 year ago

Iteration 0 Scope Definition

[ ] Compare and contrast the work done by @vexingly in aaw-contrib-containers to enable Microsimulations via Open M++ to the Jacek's work. Extend/Merge/Replace accordingly.
[ ] Make available to ACR dev the container image developed by @jacek-dudek
[ ] Create required Manifests (namespace, deployment, service)

This should enable us to host the OM++ web service on aaw-dev as a starting iteration.

Iteration 0.1 Scope Definition

To be further elaborated over the duration of Iteration 0

[ ] if needed, expose OM++ to the appropriate consuming client in AAW notebook
[ ] Design scheduling process for simulation run job (to a specific nodepool)

vexingly commented 1 year ago

You can find my notes on Kubeflow's integrated MPI training operator that I used for my POC here: https://github.com/StatCan/aaw-private/issues/95, the everpeace/kube-openmpi project was evaluated but it was created 5 years ago and has not been maintained vs the kubeflow training operators which are in active development.

Regarding provisioning a separate node pool, are there any project requirements that would need this yet? With the MPI training operator it would be as simple as labeling the manifest with the node type to use, but I think this should come as a special request from specific projects only after this have hit a limitation with our existing nodes.

Our current (unclassified) default nodes are Standard_D64as_v5 (64 CPU, 256GB RAM), which is pretty flexible for various types of job sizes and different models
Our existing nodes already have a lot of unused capacity, we can easily fit a number of jobs without spinning up an additional node at all (i.e. no cost)
MPI specific nodes would require a client to wait 15-30m for a new node to be provisioned (no matter how large or small the job) or extra costs to keep an unused node available at all times

chuckbelisle commented 1 year ago

Continuing this work in https://github.com/StatCan/openmpp/issues/3

StatCan / openmpp

Architecture and integration design for OpenM++ on dev cluster #1

Iteration 0 Scope Definition

Iteration 0.1 Scope Definition