eirrgang commented 3 years ago

This is an umbrella issue for several features (and some possible bugs) related to virtual environment preparation directives:

Pilot.prepare_env
resource config details
pre_exec for Master task
WorkerDescription? Probably not relevant: we need to install scalems into additional venvs and just run a new Master+Workers

Additional use cases:

(Re)Use package archive(s) at agent for provisioning new venvs validate venvs
dynamic provisioning/updating of venv per workload requirements

Possible buggy areas:

Expectations regarding virtual environment availability/activation for local versus ssh access methods.

Update, 19 July 2021

For performance and control, the canonical use case should be fully static venv configuration for both the Pilot agent (and bootstrapping) interpreter, remote RCT stack, and executed Tasks. However, the default behavior should work for most users, in which a venv is created in the radical sandbox on the first connection (reused if it already exists) and the RCT stack is updated within the Pilot sandbox for each session.

In the case of non-RCT Python dependencies, the Pilot has an evolving prepare_env feature that can be used for a Task dependency (named_env) to provide a dynamically created venv with a list of requested packages.

Locally prepared package distribution archives can be used, such as by staging with the Pilot.stage_in before doing prepare_env.

There is some work underway to place full venvs at run time, specifically to handle use cases in which it is important to run the Python stack from a local filesystem on an HPC compute node. So far, this is limited to use of conda freeze.

Upcoming RP features will provide a mechanism for environment caching so that module load, source $VENV/bin/activate, etc. do not need to be repeated for every task. However, the current mechanisms for optimal (static) venv usage are

use virtenv_mode=use, virtenv=/path/to/venv, rp_version=installed in the RP resource definition, and
activate alternative Task venvs using pre_exec.

The user (or client) is then responsible for maintaining venv(s) with the correct RCT stack (matching the API used by the client-side RCT stack), the scalems package, and any dependencies of the workflow.

Validation of the target venv dependencies is probably a long way off. In the near term, we probably want to pay special attention to ImportErrors and similar exceptions from scalems-managed Tasks. We might even want to consider a tutorial or something that includes a walk-through and troubleshooting regarding virtual environment preparation and validation (with minimal wastage of HPC resources, with appropriate launch of trial jobs or local execution). The priority of this depends on the relative importance of extended Python-based workflows versus workflows that only rely on the scalems Python package and otherwise rely exclusively on command line executables.

Dynamic evaluation of workflow requirements with reactive venv re-provisioning is probably well out of project scope. However, some amount of workflow dependency checks should be possible through some combination of currently available third party tools or frameworks. Pipenv allows a virtual env to be hashed and checked. Spack is frequently used to define highly reproducible environments with easily re-used recipes. Conda may provide more portable or verifiable installations than, say, pip freeze, but it is unclear to this author at this time. Docker and Singularity provide ways to prepare isolated and reusable environments that are easily portable within an HPC cluster, but not quite as portable from arbitrary client systems to arbitrary execution environments.

Note that some amount of import dependency checking is inherent with static tools (linters, flake8, mypy) and, say, doctest.

andre-merzky commented 3 years ago

Most items will be addressed (or at least touched) in our June 2021 release.

virtual environment activation for ssh connections

Can you please remind me what the problem is in the context of ssh?

eirrgang commented 3 years ago

Can you please remind me what the problem is in the context of ssh?

At this point, the most mysterious errors seem to be resolved. The remaining issues are related to failure modes. Successful Worker tasks can hang indefinitely (presumably waiting for Master.result_cb()) when the Master task fails.

The remediation seems to be to

        master.wait(state=[rp.states.AGENT_EXECUTING] + rp.FINAL)
        assert master.state not in {rp.CANCELED, rp.FAILED}

before waiting on worker tasks.

One example of such a failure is when (an appropriate) scalems package is not properly installed in the agent environment (such as if the caller neglected to provide an appropriate pre_exec or if the prepare_env failed), resulting either in an import error or a failure to find the scalems_rp_master console script on the PATH. (This example also illustrates the need to do some compatibility/version checking.)

eirrgang commented 3 years ago

Updates

I have added a local.github resource to the resource_local.json file we install in Docker and on GitHub Actions to allow our test workflow to use the static venv at all points.

I added a "local" access scheme to the local.docker resource definition so that we don't have to override local.localhost in order to use the static venv we prepare for Docker-based testing.

I am currently looking at how best to force users to specify valid venvs and target resources. We will need to add some documentation.

We should reference RP documentation where possible, but I could use some help here.

https://radicalpilot.readthedocs.io/en/stable/machconf.html#chapter-machconf doesn't describe all of the semantics for the various keys relevant to virtual envs and package versions.
I can't find any reference to using pre_exec to activate venvs in the various places that this could be relevant.
We also need to explain when/how to use Pilot.prepare_env() (and TaskDescription.named_env vs. TaskDescription.pre_exec)

eirrgang commented 3 years ago

Note: pre_exec has traditionally run at the beginning of the task launching from the Pilot agent, rather than after process launch through (e.g.) mpiexec. This could result in unexpected behavior, particularly when assuming that certain environment variables would be available as a result of pre_exec, or if there are differences in the MPI launch system in different environments.

A new RP update splits the task launch script from the task execution wrapper script, also splitting pre_exec into pre_launch (run at the point that pre_exec traditionally ran) and pre_exec, which now runs in the beginning of the task wrapper script as part of the actual task runtime. Additionally, entries from a pre_rank dictionary are composed into the script after pre_exec where the current rank matches a key in pre_rank.

eirrgang commented 2 years ago

Update: https://github.com/radical-cybertools/radical.pilot/issues/2589 has been closed, but I haven't checked what sort of feedback is given by failing prepare_env calls yet. We can resume testing of test_rp_venv.py::test_prepare_venv at some point soon, but we should confirm that failures are programmatically detectable before beginning to migrate back from pre_exec to named_env.

SCALE-MS / scale-ms

RP virtual environment flexibility #90

Update, 19 July 2021

Updates