SCALE-MS / scale-ms

SCALE-MS design and development
GNU Lesser General Public License v2.1
4 stars 4 forks source link

RP virtual environment flexibility #90

Open eirrgang opened 3 years ago

eirrgang commented 3 years ago

This is an umbrella issue for several features (and some possible bugs) related to virtual environment preparation directives:

Additional use cases:

Possible buggy areas:

See also https://github.com/radical-cybertools/radical.pilot/pull/2312

Update, 19 July 2021

For performance and control, the canonical use case should be fully static venv configuration for both the Pilot agent (and bootstrapping) interpreter, remote RCT stack, and executed Tasks. However, the default behavior should work for most users, in which a venv is created in the radical sandbox on the first connection (reused if it already exists) and the RCT stack is updated within the Pilot sandbox for each session.

In the case of non-RCT Python dependencies, the Pilot has an evolving prepare_env feature that can be used for a Task dependency (named_env) to provide a dynamically created venv with a list of requested packages.

Locally prepared package distribution archives can be used, such as by staging with the Pilot.stage_in before doing prepare_env.

There is some work underway to place full venvs at run time, specifically to handle use cases in which it is important to run the Python stack from a local filesystem on an HPC compute node. So far, this is limited to use of conda freeze.

Upcoming RP features will provide a mechanism for environment caching so that module load, source $VENV/bin/activate, etc. do not need to be repeated for every task. However, the current mechanisms for optimal (static) venv usage are

  1. use virtenv_mode=use, virtenv=/path/to/venv, rp_version=installed in the RP resource definition, and
  2. activate alternative Task venvs using pre_exec.

The user (or client) is then responsible for maintaining venv(s) with the correct RCT stack (matching the API used by the client-side RCT stack), the scalems package, and any dependencies of the workflow.

Validation of the target venv dependencies is probably a long way off. In the near term, we probably want to pay special attention to ImportErrors and similar exceptions from scalems-managed Tasks. We might even want to consider a tutorial or something that includes a walk-through and troubleshooting regarding virtual environment preparation and validation (with minimal wastage of HPC resources, with appropriate launch of trial jobs or local execution). The priority of this depends on the relative importance of extended Python-based workflows versus workflows that only rely on the scalems Python package and otherwise rely exclusively on command line executables.

Dynamic evaluation of workflow requirements with reactive venv re-provisioning is probably well out of project scope. However, some amount of workflow dependency checks should be possible through some combination of currently available third party tools or frameworks. Pipenv allows a virtual env to be hashed and checked. Spack is frequently used to define highly reproducible environments with easily re-used recipes. Conda may provide more portable or verifiable installations than, say, pip freeze, but it is unclear to this author at this time. Docker and Singularity provide ways to prepare isolated and reusable environments that are easily portable within an HPC cluster, but not quite as portable from arbitrary client systems to arbitrary execution environments.

Note that some amount of import dependency checking is inherent with static tools (linters, flake8, mypy) and, say, doctest.

andre-merzky commented 3 years ago

Most items will be addressed (or at least touched) in our June 2021 release.

virtual environment activation for ssh connections

Can you please remind me what the problem is in the context of ssh?

eirrgang commented 3 years ago

Can you please remind me what the problem is in the context of ssh?

At this point, the most mysterious errors seem to be resolved. The remaining issues are related to failure modes. Successful Worker tasks can hang indefinitely (presumably waiting for Master.result_cb()) when the Master task fails.

The remediation seems to be to

        master.wait(state=[rp.states.AGENT_EXECUTING] + rp.FINAL)
        assert master.state not in {rp.CANCELED, rp.FAILED}

before waiting on worker tasks.

One example of such a failure is when (an appropriate) scalems package is not properly installed in the agent environment (such as if the caller neglected to provide an appropriate pre_exec or if the prepare_env failed), resulting either in an import error or a failure to find the scalems_rp_master console script on the PATH. (This example also illustrates the need to do some compatibility/version checking.)

eirrgang commented 3 years ago

Updates

I have added a local.github resource to the resource_local.json file we install in Docker and on GitHub Actions to allow our test workflow to use the static venv at all points.

I added a "local" access scheme to the local.docker resource definition so that we don't have to override local.localhost in order to use the static venv we prepare for Docker-based testing.

I am currently looking at how best to force users to specify valid venvs and target resources. We will need to add some documentation.

We should reference RP documentation where possible, but I could use some help here.

eirrgang commented 3 years ago

Note: pre_exec has traditionally run at the beginning of the task launching from the Pilot agent, rather than after process launch through (e.g.) mpiexec. This could result in unexpected behavior, particularly when assuming that certain environment variables would be available as a result of pre_exec, or if there are differences in the MPI launch system in different environments.

A new RP update splits the task launch script from the task execution wrapper script, also splitting pre_exec into pre_launch (run at the point that pre_exec traditionally ran) and pre_exec, which now runs in the beginning of the task wrapper script as part of the actual task runtime. Additionally, entries from a pre_rank dictionary are composed into the script after pre_exec where the current rank matches a key in pre_rank.

eirrgang commented 2 years ago

Update: https://github.com/radical-cybertools/radical.pilot/issues/2589 has been closed, but I haven't checked what sort of feedback is given by failing prepare_env calls yet. We can resume testing of test_rp_venv.py::test_prepare_venv at some point soon, but we should confirm that failures are programmatically detectable before beginning to migrate back from pre_exec to named_env.