Open MilesCranmer opened 3 months ago
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
pysr/julia_extensions.py | 16 | 18 | 88.89% | ||
pysr/julia_helpers.py | 7 | 9 | 77.78% | ||
<!-- | Total: | 26 | 30 | 86.67% | --> |
Totals | |
---|---|
Change from base Build 10346074402: | -0.02% |
Covered Lines: | 1149 |
Relevant Lines: | 1226 |
Probably also want to allow specifying MPI options like the hosts to run on.
I'll give this a try today. Looks interesting. I assume PySRRegressor also supports cluster_manager="slurm"
. Both of these approaches are interesting/valuable. Note that MPI will be quite sensitive to the network topology.
No MPI joy. I built the branch on a Cluster Toolkit deployed Slurm cluster. Those clusters use Open MPI and/or Intel MPI (which offers better stability/performance). It looks like PySR is using MPICH. The error is here
Just doing Slurm did not succeed either. I got this error after doing a 0.19.3 install via pip
.
The MPI one I'm not sure about but I think the Slurm one should* work as I've usually been able to have it working on my cluster.
Things to keep in mind: PySR will run srun
for you, so you just need to call the script a single time on the head node from within a slurm allocation. It will internally dispatch using srun
and set up the network of workers. i.e., it's a bit different from how MPI works and you would manually launch the workers yourself.
Then, the error message you are seeing:
ArgumentError: Package SymbolicRegression not found in current path.
- Run `import Pkg; Pkg.add("SymbolicRegression")` to install the SymbolicRegression package.
This is strange because it means the workers are not activating the right environment. Do you know if the workers all have access to the same folder, across nodes? Or is it a different file system?
I think the Slurm one should* work as I've usually been able to have it working on my cluster.
The trick is to set JULIA_PROJECT
appropriately, otherwise as you mentioned the right environment is not activated.
There are a few more things that maybe I should create issues for:
srun
switches, e.g., --ntasks-per-node
, --cpus-per-task
etc.pysr
from a container on a cluster (with the appropriate --bind
s) srun
needs to invoke julia via the container rather than directly from the file system (since it won't be there)when running pysr from a container on a cluster (with the appropriate --binds) srun needs to invoke julia via the container rather than directly from the file system (since it won't be there)
According to the docs for Distributed.addprocs
the exename:
keyword argument specifies the name of the Julia executable. To be able to fully containerize PySR
%runscript:
/opt/julia-1.10.4/bin/julia
exename:
keyword argument
So far the distributed support of PySR has relied on ClusterManagers.jl. This PR adds MPIClusterManagers.jl (and MPI.jl) which should make PySR more compatible across clusters, since MPI.jl is standardized.
@wkharold I'd be interested to hear if this works for your cluster. You can use it with:
Note the command runs
mpirun
internally so you only need to launch the job on the head node of a slurm allocation, and it will "spread out" over the job.