MPI support - Githubissues

MilesCranmer / PySR

High-Performance Symbolic Regression in Python and Julia

https://ai.damtp.cam.ac.uk/pysr

Apache License 2.0

2.44k stars 217 forks source link

MPI support #698

Open MilesCranmer opened 3 months ago

MilesCranmer commented 3 months ago

So far the distributed support of PySR has relied on ClusterManagers.jl. This PR adds MPIClusterManagers.jl (and MPI.jl) which should make PySR more compatible across clusters, since MPI.jl is standardized.

@wkharold I'd be interested to hear if this works for your cluster. You can use it with:

model = PySRRegressor(multithreading=False, procs=num_nodes*num_cores, cluster_manager="mpi")

Note the command runs mpirun internally so you only need to launch the job on the head node of a slurm allocation, and it will "spread out" over the job.

coveralls commented 3 months ago

Pull Request Test Coverage Report for Build 10395158160

Details

26 of 30 (86.67%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.02%) to 93.719%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pysr/julia_extensions.py	16	18	88.89%
pysr/julia_helpers.py	7	9	77.78%
<!--	Total:	26	30	86.67%	-->

Totals
Change from base Build 10346074402:	-0.02%
Covered Lines:	1149
Relevant Lines:	1226

💛 - Coveralls

MilesCranmer commented 3 months ago

Probably also want to allow specifying MPI options like the hosts to run on.

wkharold commented 3 months ago

I'll give this a try today. Looks interesting. I assume PySRRegressor also supports cluster_manager="slurm". Both of these approaches are interesting/valuable. Note that MPI will be quite sensitive to the network topology.

wkharold commented 3 months ago

No MPI joy. I built the branch on a Cluster Toolkit deployed Slurm cluster. Those clusters use Open MPI and/or Intel MPI (which offers better stability/performance). It looks like PySR is using MPICH. The error is here

wkharold commented 3 months ago

Just doing Slurm did not succeed either. I got this error after doing a 0.19.3 install via pip.

MilesCranmer commented 3 months ago

The MPI one I'm not sure about but I think the Slurm one should* work as I've usually been able to have it working on my cluster.

Things to keep in mind: PySR will run srun for you, so you just need to call the script a single time on the head node from within a slurm allocation. It will internally dispatch using srun and set up the network of workers. i.e., it's a bit different from how MPI works and you would manually launch the workers yourself.

Then, the error message you are seeing:

ArgumentError: Package SymbolicRegression not found in current path.
- Run `import Pkg; Pkg.add("SymbolicRegression")` to install the SymbolicRegression package.

This is strange because it means the workers are not activating the right environment. Do you know if the workers all have access to the same folder, across nodes? Or is it a different file system?

wkharold commented 3 months ago

I think the Slurm one should* work as I've usually been able to have it working on my cluster.

The trick is to set JULIA_PROJECT appropriately, otherwise as you mentioned the right environment is not activated.

There are a few more things that maybe I should create issues for:

the Google Cluster Toolkit can create partitions whose nodes are spun up on demand, the 60 timeout is too short in that case
it would be nice to be able to add additional srun switches, e.g., --ntasks-per-node, --cpus-per-task etc.
when running pysr from a container on a cluster (with the appropriate --binds) srun needs to invoke julia via the container rather than directly from the file system (since it won't be there)

wkharold commented 3 months ago

when running pysr from a container on a cluster (with the appropriate --binds) srun needs to invoke julia via the container rather than directly from the file system (since it won't be there)

According to the docs for Distributed.addprocs the exename: keyword argument specifies the name of the Julia executable. To be able to fully containerize PySR

the container's default action should be to run Julia, e.g., in Apptainer
```
%runscript: 
/opt/julia-1.10.4/bin/julia
```
the PySR interface should allow users to specify the exename: keyword argument