Open eirrgang opened 3 years ago
Most items will be addressed (or at least touched) in our June 2021 release.
virtual environment activation for ssh connections
Can you please remind me what the problem is in the context of ssh?
Can you please remind me what the problem is in the context of ssh?
At this point, the most mysterious errors seem to be resolved. The remaining issues are related to failure modes. Successful Worker tasks can hang indefinitely (presumably waiting for Master.result_cb()
) when the Master task fails.
The remediation seems to be to
master.wait(state=[rp.states.AGENT_EXECUTING] + rp.FINAL)
assert master.state not in {rp.CANCELED, rp.FAILED}
before waiting on worker tasks.
One example of such a failure is when (an appropriate) scalems package is not properly installed in the agent environment (such as if the caller neglected to provide an appropriate pre_exec
or if the prepare_env
failed), resulting either in an import error or a failure to find the scalems_rp_master
console script on the PATH. (This example also illustrates the need to do some compatibility/version checking.)
I have added a local.github
resource to the resource_local.json file we install in Docker and on GitHub Actions to allow our test workflow to use the static venv at all points.
I added a "local" access scheme to the local.docker
resource definition so that we don't have to override local.localhost
in order to use the static venv we prepare for Docker-based testing.
I am currently looking at how best to force users to specify valid venvs and target resources. We will need to add some documentation.
We should reference RP documentation where possible, but I could use some help here.
Pilot.prepare_env()
(and TaskDescription.named_env
vs. TaskDescription.pre_exec
)Note: pre_exec
has traditionally run at the beginning of the task launching from the Pilot agent, rather than after process launch through (e.g.) mpiexec
. This could result in unexpected behavior, particularly when assuming that certain environment variables would be available as a result of pre_exec
, or if there are differences in the MPI launch system in different environments.
A new RP update splits the task launch script from the task execution wrapper script, also splitting pre_exec
into pre_launch
(run at the point that pre_exec
traditionally ran) and pre_exec
, which now runs in the beginning of the task wrapper script as part of the actual task runtime. Additionally, entries from a pre_rank
dictionary are composed into the script after pre_exec
where the current rank matches a key in pre_rank
.
Update: https://github.com/radical-cybertools/radical.pilot/issues/2589 has been closed, but I haven't checked what sort of feedback is given by failing prepare_env
calls yet. We can resume testing of test_rp_venv.py::test_prepare_venv
at some point soon, but we should confirm that failures are programmatically detectable before beginning to migrate back from pre_exec
to named_env
.
This is an umbrella issue for several features (and some possible bugs) related to virtual environment preparation directives:
Pilot.prepare_env
Additional use cases:
Possible buggy areas:
local
versusssh
access methods.See also https://github.com/radical-cybertools/radical.pilot/pull/2312
Update, 19 July 2021
For performance and control, the canonical use case should be fully static venv configuration for both the Pilot agent (and bootstrapping) interpreter, remote RCT stack, and executed Tasks. However, the default behavior should work for most users, in which a venv is created in the radical sandbox on the first connection (reused if it already exists) and the RCT stack is updated within the Pilot sandbox for each session.
In the case of non-RCT Python dependencies, the Pilot has an evolving
prepare_env
feature that can be used for a Task dependency (named_env
) to provide a dynamically created venv with a list of requested packages.Locally prepared package distribution archives can be used, such as by staging with the
Pilot.stage_in
before doingprepare_env
.There is some work underway to place full venvs at run time, specifically to handle use cases in which it is important to run the Python stack from a local filesystem on an HPC compute node. So far, this is limited to use of
conda freeze
.Upcoming RP features will provide a mechanism for environment caching so that
module load
,source $VENV/bin/activate
, etc. do not need to be repeated for every task. However, the current mechanisms for optimal (static) venv usage arevirtenv_mode=use, virtenv=/path/to/venv, rp_version=installed
in the RP resource definition, andpre_exec
.The user (or client) is then responsible for maintaining venv(s) with the correct RCT stack (matching the API used by the client-side RCT stack), the scalems package, and any dependencies of the workflow.
Validation of the target venv dependencies is probably a long way off. In the near term, we probably want to pay special attention to ImportErrors and similar exceptions from scalems-managed Tasks. We might even want to consider a tutorial or something that includes a walk-through and troubleshooting regarding virtual environment preparation and validation (with minimal wastage of HPC resources, with appropriate launch of trial jobs or local execution). The priority of this depends on the relative importance of extended Python-based workflows versus workflows that only rely on the scalems Python package and otherwise rely exclusively on command line executables.
Dynamic evaluation of workflow requirements with reactive venv re-provisioning is probably well out of project scope. However, some amount of workflow dependency checks should be possible through some combination of currently available third party tools or frameworks. Pipenv allows a virtual env to be hashed and checked. Spack is frequently used to define highly reproducible environments with easily re-used recipes. Conda may provide more portable or verifiable installations than, say,
pip freeze
, but it is unclear to this author at this time. Docker and Singularity provide ways to prepare isolated and reusable environments that are easily portable within an HPC cluster, but not quite as portable from arbitrary client systems to arbitrary execution environments.Note that some amount of import dependency checking is inherent with static tools (linters,
flake8
,mypy
) and, say,doctest
.