Refactor Manifest building solution to accomadate for openmpp#30 - Githubissues

StatCan / openmpp

Implementing the OpenM++ microsimulation framework as a Kubernetes service on the StatCan cloud.

0 stars 1 forks source link

Refactor Manifest building solution to accomadate for openmpp#30 #32

Closed Souheil-Yazji closed 11 months ago

Souheil-Yazji commented 11 months ago

Child issue of https://github.com/StatCan/openmpp/issues/30

When running the mpi models from the UI, we need to preface the model execution with setting the ulimits in the same shell instance.

Suggested Solution

What we can do is build the mpirun command into a shell script, and wrap it with ulimit setting, simply always call that shell script in a similar fashion to what I do in #openmpp-30

Note, we are running mpirun ... .sh on the launcher via ["bin/bash", "-c", "mpirun ..."], which is running the shell script across each worker, since it's being passed as the mpirun executable arg.

Success Criteria

[x] MPI jobs are successfully submitted to the cluster
[x] the shell script is kept on blob storage (it must be accessable by the workers)
~the shell script will need to be unique -> handles race condition where 2 employees possibly submit different mpi models~

jacek-dudek commented 11 months ago

Linked to pull request on this issue. Changes in pull request appear to resolve the segmentation fault error that was being thrown when oncosim was run as mpijob before. However I am getting database corruption issues on the oncosim databases that prevent me from confirming that we have a successful oncosim mpijob run via the UI. Will be planning to rebuilt oncosim from source to get a fresh database and execute some mpijob run to confirm one way or another.