Open giordano opened 3 months ago
I'm not sure this would work and I haven't played with it, but I was wondering if replacing
# hope the python environment was activated
CMD() {
process_worker_pool.py ...
}
with something like
# no need to activate the environment
CMD() {
$(python found with sys.executable) $(path to script found with shutil.which("process_worker_pool.py"))...
}
would be a viable option
@giordano i've wondered about that before. in some situations it will work but not all of them - I think it would be easy enough to start using it yourself in your own setups as a hack in your own code, and see how that works.
Alternatively, process_worker_pool.py is executable and contains a hard-coded reference to the relevant Python, so remove the whole python prefix from that and run it as if its just some other unix executable...?
I tried
diff --git a/parsl/executors/high_throughput/executor.py b/parsl/executors/high_throughput/executor.py
index 6918336..48bff4b 100644
--- a/parsl/executors/high_throughput/executor.py
+++ b/parsl/executors/high_throughput/executor.py
@@ -1,7 +1,9 @@
import logging
import math
import pickle
+import shutil
import subprocess
+import sys
import threading
import typing
import warnings
@@ -36,7 +38,8 @@ from parsl.utils import RepresentationMixin
logger = logging.getLogger(__name__)
-DEFAULT_LAUNCH_CMD = ("process_worker_pool.py {debug} {max_workers_per_node} "
+DEFAULT_LAUNCH_CMD = (f"{sys.executable} "
+ f"{shutil.which('process_worker_pool.py')} {{debug}} {{max_workers_per_node}} "
"-a {addresses} "
"-p {prefetch_capacity} "
"-c {cores_per_worker} "
as a quick and dirty hack (I didn't want to change too much code just for testing this solution) and it seems to work for my use case: I can run the pipeline without necessarily activating the environment. Even if this doesn't solve all problems, can something like this be a decent option?
Tagging @rjmello because he's working on something related, in the context of Globus Compute
I think we need to be careful about the principles behind this rather than adding complexity without understanding why changes work.
The patch @giordano has above makes this change:
It still uses $PATH
to discover process worker pool.
What are the situations where this improves things?
Why would process worker pool be running with the wrong Python command? Why would sys.executable not be the Python that is already hard-coded at the top of process_worker_pool by pip install
?
(for example, is it because the wrong process worker pool has been found on the path? in which case shutil.which('process_worker_pool.py')
still runs the wrong process worker pool.
I guess @giordano has an environment in which this improves things so it would be interesting to understand these questions in the context of that environment.
I guess @giordano has an environment in which this improves things so it would be interesting to understand these questions in the context of that environment.
I simply created a local venv
environment on the login node of a cluster and wanted to run the parsl pipeline on the compute node (using a HighThroughputExecutor
, see https://github.com/Parsl/parsl/issues/3542#issue-2429819496), but process_worker_pool.py
won't be in PATH
on the compute node if I don't manually activate with worker_init
the environment I ran the parsl pipeline from. I find that having to activate the same exact environment twice on both the login node (to start the pipeline) and on the compute node (to actually run it) is error-prone and not very user friendly.
Is your feature request related to a problem? Please describe.
When you submit a pipeline with Parsl from a login node, to be executed on a compute node, in the
worker_init
attribute of the executor provider you likely need to activate the Python virtual environment in which Parsl needs to run. But this is not very user friendly, as the user just had to activate the environment to launch the pipeline in the first place, so this is asking the user to repeat twice the same operation. I was told on Slack there have been some thoughts about automating this in the past, but not real work.Describe the solution you'd like
Ideally Parsl would automate the Python environment activation. The challenge is that answering the questions "what's the current environment? How do I activate it?" is very complicated in the general case, because it'll also depend on the package manager used.
Describe alternatives you've considered
The alternative is doing nothing, and let the user figure it out, but I find this situation a bit frustrating, the reason why I'd like to use Parsl is to simplify things, not to complicate them.
Additional context
N/A