amusecode / amuse

Astrophysical Multipurpose Software Environment. This is the main repository for AMUSE
http://www.amusecode.org
Apache License 2.0
155 stars 98 forks source link

Running AMUSE without a (stable) network #1074

Open rieder opened 6 days ago

rieder commented 6 days ago

(see also #128)

AMUSE requires a stable network connection for MPI to communicate to various workers. When no network is available, the network is nonstandard (e.g. connecting via a VPN), or the network is unstable, AMUSE doesn't run well. This probably needs to be addressed in a better way than using a command line workaround (e.g. mpirun --mca btl_tcp_if_include lo0 -n 1 python test.py). Maybe we can choose the network to be used from within AMUSE in some way? And show the available network options before?

LourensVeen commented 6 days ago

Just to clarify, the problem here is that AMUSE is running all on one machine, and that machine has a network connection to the outside world that is unstable, and even though AMUSE doesn't use that connection it still causes problems?

I can see how that could happen with OpenMPI trying to use every network connection it can find in parallel, including some that don't allow connecting back to the local host, and then that MCA parameter would help by telling it to ignore everything but the loopback interface. It's actually possible to have AMUSE add that option automatically, which would also solve the same problem I'm having with my somewhat exotic networking setup on Ubuntu. Of course, enabling that option by default would also make it impossible to run on multiple nodes of a cluster :smile:.

An option might indeed be to have AMUSE pop up some kind of network configuration tool, but then you don't want it to do that when running on a cluster either. We could inspect the environment and if we don't find any evidence of running inside a SLURM job, assume we're on a single machine and configure MPI accordingly automatically. That would only break if you have a non-SLURM cluster, which are rare these days but may become more common again if Flux starts gaining ground...