[Epic] OpenM++ and model serving as a platform

chuckbelisle commented 1 year ago

Let's think about the big picture of how we want to serve Machine learning models and the OpenM++ features. There has also been discussions about converting existing models from ModGen to work with OpenM++. This issue will be the overarching place for information and tasks definition throughout the architecting and scoping of the project.

Note that for the time being it will live in the AAW github project, but assume that it will have its own in the very short term.

It think it makes sense to reiterate some of the stuff that I put down in the performance agreement and separate that into smaller actionable tasks. Here are these components again:

[x] StatCan/openmpp#1
[x] StatCan/openmpp#3
[x] StatCan/aaw#1839
[ ] https://github.com/StatCan/openmpp/issues/8
[x] #9
[x] #12
[ ] Setting up the existing OpenM++ browser based UI so that it runs as a web service on a public IP upon deployment.
[x] #18
[x] #19
[ ] Enable running the parallelized version of OpenM++, which uses the MPI standard for parallel computing.
[ ] I have some experience deploying and provisioning VMs directly, so that would be the approach that I'd like to use for the initial deployment. I can use my personal cloud account for initial prototyping to minimize organizational overhead.

From my reading of the OpenM++ wiki so far, the biggest functionality gap appears to be setting up OpenM++ as a service, and facilitating the uploading and transpiling of Modgen style models into OpenM++ style models. So in my opinion that would a priority in terms of providing a well-rounded service.

Also it might be good for me to engage in some learning tasks about kubernetes and aaw based deployments.

chuckbelisle commented 1 year ago

More related information that was sent by Steve Gribble on July 22nd 2023

Subject: OpenM Inc, StatCan and developing internal support team - Follow up

Hi all,

This email follows up on suggested concrete actions I mentioned at our meeting last Wednesday 2023-07-19.

I mentioned mechanical prerequisite StatCan tasks which StatCan can advance now, with no knowledge required of the substance or architecture of ompp, or underlying tech.

These tasks are required for StatCan to make internal contributions to the OpenM++ project (ompp).

Mostly, they involve spec’ing and provisioning development system(s) for each target OS, obtaining and installing required development software, building ompp and all its components and utilities on each target OS, and building and testing the models in the ompp test suite.

These tasks are largely mechanical and are described in our wiki, sometimes with exact command lines or screenshots. Using that information, StatCan should be able to accomplish these prerequisites autonomously.

Of course, feel free to reach out if you encounter issues, or notice errors or omissions in our wiki instructions.

Our ability to help with issues peculiar to the StatCan environment, e.g. permissions for files, devices, OS, and software, is obviously limited.

This email also describes the first substantive task I suggested at the meeting. It then lists some learning resources. It concludes with a thought about hands-on learning for a prospective ompp developer.

Regards to all,

Steve

PREREQUISITE TASKS:

Here’s more detail on the prerequisite tasks for ompp development. This list is organized so you can use it as a checklist to evaluate progress to goal if desired.

Complete and email us the “OpenM++ contributor agreement” for each StatCan developer who will be working on the project.

This is not required for any prerequisite task but will be required to push any software modifications to the ompp git repositories.

Spec and provision one or more StatCan “ompp dev systems” (ODSs) in one or more StatCan security environments.

To build, develop, debug, and test all ompp components.

An ODS is needed to support both Windows and Linux development, either using VMs or separate systems.

Ompp needs to be built and tested in MacOS, so a MacOS ODS is required for that.

An ODS need to access the various security environments of all existing StatCan models (DemoSim, OncoSim, Pohem, CRISM) to reproduce and troubleshoot issues encountered by StatCan model developers using their models (both release and development versions) in their various security environments.

A StatCan model might exist (and require support) in more than one security environment, e.g. CRISM may have a public version and a confidential version.

Create StatCan “ompp git repos” (OGRs)

Each OGR must be an exact clone of the corresponding ompp git repository on GitHub.

Each OGR needs to be sync’d from time to time with the corresponding ompp git repo (on GitHub). That can be done by copying it as a zip archive to a lower security environment and syncing from there.

Ompp has multiple git repositories, organized on tech lines (see ompp wiki), and the ompp wiki has its own distinct repo of markdown files and images.

Each ODS needs to have access to an OGR from its security environment (by repo duplication if necessary).

Install required development software on all ODSs

Install all software and tools required to build all ompp components, including stand-alone utilities, on each ODS, in particular on Windows and on Linux.

The ompp wiki contains instructions about the software needed and where to obtain it.

Required software includes C++, Bison, Flex, Go, Perl, Python, R, Node.js, MPI.

Some of this software requires additional configuration after installation to obtain and install secondary components, e.g. Go, Perl, Python, R, Node.js.

Build and test all ompp components on each ODS.

instructions in our wiki.

Ompp components include: ompp runtime libraries, ompp compiler, Perl standalone tools (e.g. test_models), Go standalone tools and components (dbcopy, oms), the browser-based user interface, the R and Python examples and packages.

Build/run/test all models in the ompp model suite (in the git repo) using test_models.

Build/launch/test the browser-based UI using one or more test models.

Reproduce the R and Python examples in the ompp wiki.

In Windows, use VS to build and trace a Debug version of a model.

In Linux, use Visual Studio Code to build and trace a Debug version of a model.

Install and test Modgen (Windows ODS only)

after Modgen installation, build/run/test all models in the ompp model suite.

Build a model in Modgen in debug mode, and use VS to trace execution using breakpoints.

Create copies of major StatCan models for testing.

Copy (git clone) each major StatCan model in the ODS.

Build/run/test each StatCan model using test_models.

Perform a run of OncoSim at scale using multi-threading and a large population (32M).

Ok, that’s it for the mechanical steps to create an environment for ompp development.

SUBSTANTIVE TASK:

I suggested this task because it requires little understanding of ompp and should be doable by following the recipe on our wiki. It is a previously identified outstanding task which will enable StatCan model devs and users to do large numbers of runs remotely inside the StatCan perimeter, and at reasonable cost.

Create an on-demand ompp cloud cluster (OCC) for running StatCan models by setting up a front-end server and some (e.g. 16) “Cloud Main” on-demand back-end servers (16-core, 64 GB each).

Instructions are on our wiki but may need tweaking for the StatCan cloud environment.

We have tested OCCs in Google Cloud and in Microsoft Azure, and examples are on our wiki.

A working example, if required, is the CPAC instance in Google Cloud, which supports StatCan’s OncoSim and associated models, and a community of OncoSim users.

Install and test selected StatCan models on the StatCan OCC.

candidate models are OncoSim, POHEM, CRISM, and DemoSim.

LEARNING MATERIALS:

Here are some notes on learning materials for prospective StatCan ompp developers.

There exist good, easily identified (use Google) web-based resources for learning C++, C++ STL collections, go, etc.

The manuals for Flex and Bison contain useful introductions to those compiler-compiler tools. That said, compiler-compiler tech and dev is usually considered an advanced subject. There are courses on the subject.

Documentation on Modgen was mentioned at the meeting. Most of that is on the StatCan external (or internal mirror) web site (search for Microsimulation). Don’t miss a self-extracting zip package which contains a set of animated PowerPoint presentations which were used to give a course on Modgen for model devs (the decks were designed with animated callouts for self-learning). The “Modgen Developer’s Guide” is a prime reference on the language used to specify models in Modgen and in ompp.

Navigate to and read all topics in the ompp wiki.

Probe a model using event trace to understand what it’s doing.

ON-HANDS LEARNING:

It might be helpful for a prospective programmer on the ompp project to spend some time working directly on a StatCan model and directly with members of a StatCan model team, as a (temporary) team member doing model development. That would help the programmer to understand the language, the environment, and how both are used at StatCan. And of course, that would help develop a working relationship with StatCan model developers.

Ompp is an advanced technology, which includes a language and environment, formally a bit like R or Python. Both R and Python, as it turns out, are written in the C language, and R/Python packages are often written in C++. So, a programmer working on the R or Python projects needs to know C and C++. But that programmer should also be familiar with R/Python itself, and how it is used. An effective way for a programmer to become familiar with R/Python is to use it solve problems, like an analyst would. The same is true for ompp.

jacek-dudek commented 1 year ago

Tasks going forward: For deploying the basic service: Assign a domain name. Upload container image to the standard location for AAW. (optional) Replace the LoadBalancer object with appropriate routing rules in the existing AAW Ingress object. (optional) Incorporate into the CI/CD setup used by AAW. (optional)

For service implemented using a StatefulSet workload: Package deployment into a Helm chart. Research how to specify PersistentVolumes and how to map to appropriate instances of the web service. Research how to authenticate different users and how to route their session to appropriate instances of the web service.

For the OpenMPI backend service: Work through the github project and try to get any MPI job running on a Kubernetes cluster. Study the implementation to determine how to deploy it as a backend service that can be used by the OpenM++ web service instances.

vexingly commented 1 year ago

A few questions / details about the web service that will help to determine the architecture:

Assumptions:

Web service is a combination front-end UI (node-js) which is accessed via browser and uses a back-end (go) api that implements the functionality
Models are compiled into C++ executables and executed by the api as a direct command line execution on the local or remote (in the case of an MPI cluster)
The Web UI is used to configure the parameters for the model runs as well as execution details, and to view and export model run results
The parameters, run details & results are all stored in an sqllite database specific to each model

If a centralized web service is used for multi-tenancy some questions would be:

[ ] how is user authentication handled? our existing applications either use full trust within a namespace or implement azure SSO
[ ] can users (or projects) have separate model / database storage that is separate from other users?
[ ] Will all projects be using the same version of OpenM++? New versions could require a models database to be manually updated which may or may not be backwards compatible?
[ ] Is the solution mature enough to share a web service instance between unclassified / protected-b and/or internal / external clients and/or development / production.

I think these questions would determine if we're able to share a web instance between projects or the web instance should be isolated per namespace, or even if multiple instances are required per project based on version / data sensitivity / dev / prod.

As far as I can tell the benefits of sharing of a web instance would be:

Not needing a aaw notebook at all
- This would only be a long term goal since a security review would be required which will take some time and cost
Team members sharing the database for a model so they can share all of the run parameters/results
- This would be accomplished by sharing the database file storage, assuming no issues with concurrency
Reduce resource usage from not duplicating the web service
- Web service overhead appears to be very low as it is a very efficient architecture, once actual processing is taking to a backend

vexingly commented 1 year ago

Dropped a mini-poc you can find here to install / launch the Openm++ web interface from the jupyterlab notebook as a shortcut, much better experience than using the remote desktop image and simpler than a centralized interface.

StatCan / openmpp

[Epic] OpenM++ and model serving as a platform #2