INM-6 / multi-area-model

A large-scale spiking model of the vision-related areas of macaque cortex.
Other
70 stars 48 forks source link

simulation environment of multi node #18

Open jiaduxie opened 3 years ago

jiaduxie commented 3 years ago

Do you have a recommended configuration tutorial for the multi node nest simulation environment? It can also be the brief steps of environment configuration and the required installation package.

jarsi commented 3 years ago

Hi jiaduxie,

could you provide a little bit more details about the problems you are running in?

The Readme.md of this repository has already some points on how to set up the environment and run the model.

The exact steps you have to take depend on the cluster you are using. What kind of job scheduler does it use? Is Python 3 available? Does it have MPI support? Have you already installed NEST?

A rough sketch:

1) Manually compile NEST on your cluster, make sure python and MPI is supported. Do not use the conda version (no MPI nor OpenMP support). Use an official release (NEST master has features which are not implemented in this repository yet). Depending on your cluster you need to load some packages (eg. Python, MPI...)

2) Make sure to install all python packages listed in requirements.txt. Run: pip3 install -r requirements.txt If the cluster does not allow this, try: pip3 install --user -r requirements.txt

3) Inform the job scheduler on your system and how to run the job You will need to copy the file config_template.py to config.py. Change base_path to the absolute path of the multi-area repository. Change data_path to the path where you want to store the output of the simulation. Adapt the jobscript_template to your system. If it is SLURM, then there is already an example in place which you will could uncomment and try to use. Make sure to also load all packages that a multi-node simulation requires (eg. MPI). Change submit_cmd from None to the command your job scheduler uses. For SLURM it is sbatch.

4) Try to run run_example_fullscale.py Now you should be able to run python run_example_fullscale.py. This will set up the simulation environment, do all the preprocessing and finally submit the job to the cluster. Depending on your cluster you might want to change num_processes and local_num_threads.

If all of this works, you should be ready to run your own experiments.

I hope this helps! Best, Jari

jiaduxie commented 3 years ago

Oh,thinks Jari. I run the model in NEST of conda environment .But I installed the conda version of nest with MPI version,it also no?I try install NEST from source code,The MPI in NEST manually installed by myself conflict with the local MPI of the server?I'll ask you if you have any questions.

jarsi commented 3 years ago

I am only aware of a conda version of NEST which does not have MPI support. But maybe it exists.

To check whether your NEST version supports MPI and OpenMP, could you run in your environment the following command and post the output:

python -c "import nest; nest.Simulate(1.)"

My conda installed NEST gives me in the startupdating information that neither MPI nor OpenMP is available:

Sep 10 16:07:01 SimulationManager::startupdating [Info]: Number of local nodes: 0 Simulation time (ms): 1 Not using OpenMP Not using MPI

Concerning manual compilation. How did you try to compile NEST? Could you post what steps you have tried so far?

jiaduxie commented 3 years ago

I haven't started trying to compile manually. I run the following in my conda environment,and the output is as follows:

$python -c "import nest; nest.Simulate(1.)"

Creating default RNGs Creating new default global RNG -- N E S T -- Copyright (C) 2004 The NEST Initiative Version: nest-2.18.0 Built: Jan 27 2020 12:49:17

This program is provided AS IS and comes with NO WARRANTY. See the file LICENSE for details.

Problems or suggestions? Visit https://www.nest-simulator.org Type 'nest.help()' to find out more about NEST.

Sep 10 22:20:26 NodeManager::prepare_nodes [Info]: Preparing 0 nodes for simulation.

Sep 10 22:20:26 SimulationManager::startupdating [Info]: Number of local nodes: 0 Simulation time (ms): 1 Number of OpenMP threads: 1 Number of MPI processes: 1

Sep 10 22:20:26 SimulationManager::run [Info]: Simulation finished.

jarsi commented 3 years ago

It seems alright. Have you installed the packages from requirements.txt? Have you tired running a simulation?

jiaduxie commented 3 years ago

Yes,I have installed the packages from requirements.txt?Can you help me see if the command to execute multi-node simulation is like this?: mpirun -hostfile hostfile python run_example_downscaled.py

The hostfile is following: work0 slots = 2 work1 slots = 2

jarsi commented 3 years ago

I have no experience with hostfiles, but it looks reasonable to me. Have you adjusted num_processes and local_num_threads in the sim_dict? Have you tried running it? Did it work?

The run_example_downscaled.py is meant to be run on a local machine, for example a laptop. If you would like to experiment on a compute cluster you should exchange M.simulation.simulate() with start_job(M.simulation.label, submit_cmd, jobscript_template) (see run_example_fullscale.py) and additionally import:

from start_jobs import start_job
from config import submit_cmd, jobscript_template

In this case you need to invoke the script serially:

python run_example.py

The parallelized part is then specified in the jobscript_template in config.py.

jiaduxie commented 3 years ago

Hei,jarsi .If run a complete model on a cluster of two servers, about how much memory each machine needs to support?

jarsi commented 3 years ago

The model consumes approximately 1 TB of memory. So with two servers each server would need to provide 500 GB.

jiaduxie commented 3 years ago

Okay, thank you. Then,when you run the entire model, you use several servers and how much memory each is.

jiaduxie commented 3 years ago

Hei,jarsi.In what system are you running multiple nodes in parallel.My system is ubuntu, slurm configuration is not good.Do you have any guidance on configuring the environment?

jarsi commented 3 years ago

Hi, we do not set up the systems ourselves. We use for example JURECA from the Forschungszentrum Juelich It has everything we need already installed. What kind of system are you using?

jiaduxie commented 3 years ago

I am a server under linux system, the release version is ubuntu.In addition to running under JURECA, do you have your own running on a general server?

jiaduxie commented 3 years ago

Hi,jarsi,I am now simulating a small network for testing on two machines, and run it with the following command. It seems that the two machines run by themselves without interaction.

`mpirun.mpich -np 2 -host work0,work1 python ./multi_test.py`

In addition,Have you run his model of multi-area-model in your own cluster environment?

jarsi commented 3 years ago

This is weird. Have you adjusted the num_processes or local_num_threads variable in the sim_params dictionary? An example of how to do this is shown in the run_example_fullscale.py file. In your case you should set num_processes=2. These variables are needed in order to inform NEST about distributed computing.

Maybe you could also post what is in your multi_test.py file?

I have run the model on a local cluster. I usually just need to modify the run_example_fullscale.py and config.py to my own needs.

jiaduxie commented 3 years ago

multi_test.py:

from nest import * SetKernelStatus({"total_num_virtual_procs": 4}) pg = Create("poisson_generator", params={"rate": 50000.0}) n = Create("iaf_psc_alpha", 4) sd = Create("spike_detector", params={"to_file": True}) Connect(pg, [n[0]], syn_spec={'weight': 1000.0, 'delay': 1.0}) Connect([n[0]], [n[1]], syn_spec={'weight': 1000.0, 'delay': 1.0}) Connect([n[1]], [n[2]], syn_spec={'weight': 1000.0, 'delay': 1.0}) Connect([n[2]], [n[3]], syn_spec={'weight': 1000.0, 'delay': 1.0}) Connect(n, sd) Simulate(100.0)

jarsi commented 3 years ago

This is difficult for me to debug. On my machine I can run this without running into errors. It works with the conda installed nest (conda create --name nest_conda -c conda-forge 'nest-simulator=*=mpi_openmpi*' python) and with nest compiled from source. I suspect there might be a problem with the host file. Unfortunately I do not know a lot about those, usually the system administrators take care of this.

On you machine, are you using any resource manager such as e.g., SLURM, PBS/Torque, LSF, etc. Or are you responsible for defining everything correctly using hostfiles? What kind of system are you using?

jiaduxie commented 3 years ago

The cluster environment I use is composed of nine ordinary server machines. The system is Linux, and the release version number is debain.You run this model on a supercomputer, right? Have you ever run in your own environment? Is it necessary to install SLURM resource scheduling system?I also had a lot of problems in the process of installing SLURM, so I won't install it.

jarsi commented 3 years ago

It is not necessary to install SLURM. But I have most experience with it as all clusters I have used so far had SLURM installed. Installing a resource manager is not trivial and should be the job of a system admin, not the user. Do you have a system administrator you could ask for help? How do other people run distributed jobs on this cluster?

Could you also try the following commands and report whether something changes: mpiexec -np 2 -host work0,work1 python ./multi_test.py

mpirun -np 2 -host work0,work1 python ./multi_test.py

jiaduxie commented 3 years ago

Because my cluster environment here is composed of general servers, there is no resource scheduling system such as SLURM installed. It seems that the command you said can not complete the simulation well.

(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ mpiexec -np 2 -host work0,work1 python multi_test.py bash: orted: command not found

ORTE was unable to reliably start one or more daemons. This usually is caused by:

jarsi commented 3 years ago

Just to make sure, you are using nest installed via conda, right?

What do the following commands give you: conda list which mpirun which mpiexec which mpirun.mpich

jiaduxie commented 3 years ago

Yes, I installed nest under conda.I seem to have installed it

(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ conda list llvm-meta 7.0.0 0 conda-forge matplotlib 3.3.0 pypi_0 pypi mpi 1.0 openmpi conda-forge mpi4py 3.0.3 py38h246a051_1 conda-forge ncurses 6.2 he1b5a44_1 conda-forge nest-simulator 2.18.0 mpi_openmpi_py38h72811e1_7 conda-forge nested-dict 1.61 pypi_0 pypi numpy 1.19.1 py38h8854b6b_0 conda-forge openmp 7.0.0 h2d50403_0 conda-forge openmpi 4.0.4 hdf1f1ad_0 conda-forge openssh 8.3p1 h5957347_0 conda-forge openssl 1.1.1g h516909a_1 conda-forge pandas 1.1.0 py38h950e882_0 conda-forge

(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ which mpirun /home/work/anaconda3/envs/pynest_mpi/bin/mpirun (pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ which mpiexec /home/work/anaconda3/envs/pynest_mpi/bin/mpiexec (pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ which mpirun.mpich

jarsi commented 3 years ago

Ok thanks, the output of the last command is missing.

Using conda list you can see that nest is linked against open MPI. This is one of many MPI libraries. the command mpirun.mpich, to my understanding, instructs mpi to use the mpich version of MPI. This is different from the open MPI version that nest is linked against. These two versions are not compatible, as we can also see when you use mpirun.mpich. Both mpiexec and mpirun are installed inside of your conda environment and should be compatible with nest. I don't understand why you get the error message when using these.

jarsi commented 3 years ago

Maybe you could also check the output of: mpirun --version mpirun.mpich --version

jiaduxie commented 3 years ago

(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ which mpirun.mpich /usr/bin/mpirun.mpich (pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ mpirun --version mpmpirun (Open MPI) 4.0.4 Report bugs to http://www.open-mpi.org/community/help/ (pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ mpirun.mpich --version HYDRA build details: Version: 3.3a2 Release Date: Sun Nov 13 09:12:11 MST 2016 CC: gcc -Wl,-Bsymbolic-functions -Wl,-z,relro CXX: g++ -Wl,-Bsymbolic-functions -Wl,-z,relro F77: gfortran -Wl,-Bsymbolic-functions -Wl,-z,relro F90: gfortran -Wl,-Bsymbolic-functions -Wl,-z,relro Configure options: '--disable-option-checking' '--prefix=/usr' '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--with-libfabric' '--enable-shared' '--enable-fortran=all' '--disable-rpath' '--disable-wrapper-rpath' '--sysconfdir=/etc/mpich' '--libdir=/usr/lib/x86_64-linux-gnu' '--includedir=/usr/include/mpich' '--docdir=/usr/share/doc/mpich' '--with-hwloc-prefix=system' '--enable-checkpointing' '--with-hydra-ckpointlib=blcr' 'CPPFLAGS= -Wdate-time -D_FORTIFY_SOURCE=2 -I/build/mpich-O9at2o/mpich-3.3~a2/src/mpl/include -I/build/mpich-O9at2o/mpich-3.3~a2/src/mpl/include -I/build/mpich-O9at2o/mpich-3.3~a2/src/openpa/src -I/build/mpich-O9at2o/mpich-3.3~a2/src/openpa/src -D_REENTRANT -I/build/mpich-O9at2o/mpich-3.3~a2/src/mpi/romio/include' 'CFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -Wformat -Werror=format-security -O2' 'CXXFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -Wformat -Werror=format-security -O2' 'FFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -O2' 'FCFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -O2' 'build_alias=x86_64-linux-gnu' 'MPICHLIB_CFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -Wformat -Werror=format-security' 'MPICHLIB_CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'MPICHLIB_CXXFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -Wformat -Werror=format-security' 'MPICHLIB_FFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong' 'MPICHLIB_FCFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'FC=gfortran' 'F77=gfortran' 'MPILIBNAME=mpich' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'LIBS=' 'MPLLIBNAME=mpl' Process Manager: pmi Launchers available: ssh rsh fork slurm ll lsf sge manual persist Topology libraries available: hwloc Resource management kernels available: user slurm ll lsf sge pbs cobalt Checkpointing libraries available: blcr Demux engines available: poll select

jarsi commented 3 years ago

I think the problem is that once the jobs start to run on a node the mpi library cannot be found. This is because the PATH and LD_LIBRARY_PATH are not exported. Could you try the following:

mpirun --prefix /home/work/anaconda3/envs/pynest_mpi/bin -np 2 -host work0,work1 python ./multi_test.py

jarsi commented 3 years ago

Hi, have you made progress?

I think the problems you are seeing are related to your mpi libraries. As the conda nest is compiled against openMPI, you must also use openMPI and not mpich. This means that mpirun should be the command you should use. But we are seeing that this does not work. My guess is, that once nest starts to run on the nodes it does not find the correct MPI library, gets confused and the nest instances run independently because they do not know how to use MPI. According to the OpenMPI FAQ you can try several things.

1) Specify which mpi library to use via --prefix. I think in my previous message there might have been an error in the prefix.

2) Specify which mpi library to use via using the complete openMPI path

3) Add the following to ~/.profile

export PATH=/home/work/anaconda3/envs/pynest_mpi/bin:$PATH
export LD_LIBRARY_PATH=/home/work/anaconda3/envs/pynest_mpi/:$LD_LIBRARY_PATH

Does any of these approaches work or change the error message?

jiaduxie commented 3 years ago

I've tried it and it's still not good. Did you use conda to install nest or compile from source code?

jarsi commented 3 years ago

I tried it with both and both worked on my machine. What is the output of the different approaches posted above?

jiaduxie commented 3 years ago

It seems that I was running on work0, but work1 (Ubuntu16) terminated the job. Do you also use mpirun to run it? I am running under conda, and the compiled version of the source code is wrong. There are many kernel functions that are not defined and seem to be the old version.

(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ /home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python ./multi_test.py

[INFO] [2020.10.30 21:26:1 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::createrngs] : Creating default RNGs [INFO] [2020.10.30 21:26:1 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::creategrng] : Creating new default global RNG [INFO] [2020.10.30 21:26:1 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::createrngs] : Creating default RNGs [INFO] [2020.10.30 21:26:1 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::creategrng] : Creating new default global RNG python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed. [ubuntu16:18103] Process received signal [ubuntu16:18103] Signal: Aborted (6) [ubuntu16:18103] Signal code: (-6) [ubuntu16:18103] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fe0987a2890] [ubuntu16:18103] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fe0983dde97] [ubuntu16:18103] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fe0983df801] [ubuntu16:18103] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x3039a)[0x7fe0983cf39a] [ubuntu16:18103] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30412)[0x7fe0983cf412] [ubuntu16:18103] [ 5] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7fe08ad39eb9] [ubuntu16:18103] [ 6] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7fe08ad2c229] [ubuntu16:18103] [ 7] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7fe08ad63666] [ubuntu16:18103] [ 8] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7fe08ad22193] [ubuntu16:18103] [ 9] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7fe08ad26a32] [ubuntu16:18103] [10] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7fe08ad26e57] [ubuntu16:18103] [11] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7fe08b779a40] [ubuntu16:18103] [12] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7fe08bb764dc] [ubuntu16:18103] [13] python(+0x1b2f24)[0x56225ad72f24] [ubuntu16:18103] [14] python(_PyEval_EvalFrameDefault+0x4bd)[0x56225ad9c83d] [ubuntu16:18103] [15] python(_PyFunction_Vectorcall+0x1b7)[0x56225ad89197] [ubuntu16:18103] [16] python(_PyEval_EvalFrameDefault+0x71b)[0x56225ad9ca9b] [ubuntu16:18103] [17] python(_PyEval_EvalCodeWithName+0x260)[0x56225ad87ff0] [ubuntu16:18103] [18] python(+0x1f68ca)[0x56225adb68ca] [ubuntu16:18103] [19] python(+0x139ffd)[0x56225acf9ffd] [ubuntu16:18103] [20] python(PyVectorcall_Call+0x6e)[0x56225ad1ddee] [ubuntu16:18103] [21] python(_PyEval_EvalFrameDefault+0x60fd)[0x56225ada247d] [ubuntu16:18103] [22] python(_PyEval_EvalCodeWithName+0x260)[0x56225ad87ff0] [ubuntu16:18103] [23] python(_PyFunction_Vectorcall+0x594)[0x56225ad89574] [ubuntu16:18103] [24] python(_PyEval_EvalFrameDefault+0x4ea3)[0x56225ada1223] [ubuntu16:18103] [25] python(_PyFunction_Vectorcall+0x1b7)[0x56225ad89197] [ubuntu16:18103] [26] python(_PyEval_EvalFrameDefault+0x4bd)[0x56225ad9c83d] [ubuntu16:18103] [27] python(_PyFunction_Vectorcall+0x1b7)[0x56225ad89197] [ubuntu16:18103] [28] python(_PyEval_EvalFrameDefault+0x71b)[0x56225ad9ca9b] [ubuntu16:18103] [29] python(_PyFunction_Vectorcall+0x1b7)[0x56225ad89197] [ubuntu16:18103] End of error message

          -- N E S T --

Copyright (C) 2004 The NEST Initiative

Version: nest-2.18.0 Built: Jan 27 2020 12:49:17

This program is provided AS IS and comes with NO WARRANTY. See the file LICENSE for details.

Problems or suggestions? Visit https://www.nest-simulator.org

Type 'nest.help()' to find out more about NEST.

Oct 30 21:26:01 ModelManager::clearmodels [Info]: Models will be cleared and parameters reset.

Oct 30 21:26:01 Network::createrngs [Info]: Deleting existing random number generators

Oct 30 21:26:01 Network::createrngs [Info]: Creating default RNGs

Oct 30 21:26:01 Network::creategrng [Info]: Creating new default global RNG

Oct 30 21:26:01 RecordingDevice::set_status [Info]: Data will be recorded to file and to memory. [lyjteam-server][[20644,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(17) failed: Connection reset by peer (104)

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 18103 on node work1 exited on signal 6 (Aborted).

jarsi commented 3 years ago

Thanks for posting the output.

Just to be sure: You are using conda nest, and not a version that you compiled yourself? You should not mix those. You do not run source nest_vars.sh, also not in your .bashrc, right? This is important.

Furthermore I saw in your output that you are running nest 2.18. There is a newer version, 2.20. Could you update the conda nest and run the commands again? It might be a broken version, as you suggested.

jiaduxie commented 3 years ago

How to update nest under conda?Which directory do I need to run nest_vars.sh?

(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ sudo find / -name nest_vars.sh

/usr/bin/nest_vars.sh
/home/work/anaconda3/envs/pynest_mpi/bin/nest_vars.sh
/home/work/anaconda3/envs/newnest/bin/nest_vars.sh
/home/work/anaconda3/envs/nest/bin/nest_vars.sh
/home/work/anaconda3/pkgs/nest-simulator-2.18.0-nompi_py37hf650cc7_107/bin/nest_vars.sh
/home/work/anaconda3/pkgs/nest-simulator-2.18.0-mpi_openmpi_py38h72811e1_7/bin/nest_vars.sh
/root/anaconda3/envs/pynest/bin/nest_vars.sh
/root/anaconda3/envs/PYNEST/bin/nest_vars.sh
/root/anaconda3/pkgs/nest-simulator-2.16.0-mpi_openmpi_py37h0bdc58b_1003/bin/nest_vars.sh
/root/anaconda3/pkgs/nest-simulator-2.18.0-mpi_openmpi_py38h72811e1_7/bin/nest_vars.sh
/var/lib/docker/overlay2/828de0c0d1756b1f9357a6838ef12af765ec4b2473455595c84455be4ce94df7/diff/opt/nest/bin/nest_vars.sh
/var/lib/docker/overlay2/6b0dca5c25f9f1e73db9727dcf091c47cf241af1a186d281c8dcfd96637dd261/diff/opt/nest/bin/nest_vars.sh
/var/lib/docker/overlay2/166e90235e8f160fc32faf5b5e98d21770b4c7addbed5ebb3f6186904393c35e/diff/opt/nest/bin/nest_vars.sh
jiaduxie commented 3 years ago

I can run the program with the following command, but it seems that work0 and work1 have no interactive information. But independent of each other, running their own.

(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ mpirun.mpich -np 2 -host work0,work1 python3 multi_test.py

[INFO] [2020.11.2 10:55:31 /build/nest-phUkz7/nest-2.20.0/nestkernel/rng_manager.cpp:217 @ Network::createrngs] : Creating default RNGs [INFO] [2020.11.2 10:55:31 /build/nest-phUkz7/nest-2.20.0/nestkernel/rng_manager.cpp:260 @ Network::creategrng] : Creating new default global RNG

          -- N E S T --

Copyright (C) 2004 The NEST Initiative

Version: nest-2.20.0 Built: Sep 19 2020 07:15:24

This program is provided AS IS and comes with NO WARRANTY. See the file LICENSE for details.

Problems or suggestions? Visit https://www.nest-simulator.org

Type 'nest.help()' to find out more about NEST.

Nov 02 10:55:31 ModelManager::clearmodels [Info]: Models will be cleared and parameters reset.

Nov 02 10:55:31 Network::createrngs [Info]: Deleting existing random number generators

Nov 02 10:55:31 Network::createrngs [Info]: Creating default RNGs

Nov 02 10:55:31 Network::creategrng [Info]: Creating new default global RNG

Nov 02 10:55:31 RecordingDevice::set_status [Info]: Data will be recorded to file and to memory.

Nov 02 10:55:31 NodeManager::prepare_nodes [Info]: Preparing 8 nodes for simulation.

Nov 02 10:55:31 SimulationManager::startupdating [Info]: Number of local nodes: 8 Simulation time (ms): 100 Number of OpenMP threads: 2 Number of MPI processes: 1

Nov 02 10:55:31 SimulationManager::run [Info]: Simulation finished.

:219: RuntimeWarning: compiletime version 3.6 of module 'pynestkernel' does not match runtime version 3.8 :219: RuntimeWarning: builtins.type size changed, may indicate binary incompatibility. Expected 880, got 864 [INFO] [2020.11.2 10:55:31 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs [INFO] [2020.11.2 10:55:31 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG -- N E S T -- Copyright (C) 2004 The NEST Initiative Version: nest-2.18.0 Built: Jan 27 2020 12:49:17 This program is provided AS IS and comes with NO WARRANTY. See the file LICENSE for details. Problems or suggestions? Visit https://www.nest-simulator.org Type 'nest.help()' to find out more about NEST. Nov 02 10:55:31 ModelManager::clear_models_ [Info]: Models will be cleared and parameters reset. Nov 02 10:55:31 Network::create_rngs_ [Info]: Deleting existing random number generators Nov 02 10:55:31 Network::create_rngs_ [Info]: Creating default RNGs Nov 02 10:55:31 Network::create_grng_ [Info]: Creating new default global RNG Nov 02 10:55:31 RecordingDevice::set_status [Info]: Data will be recorded to file and to memory. Nov 02 10:55:31 NodeManager::prepare_nodes [Info]: Preparing 8 nodes for simulation. Nov 02 10:55:31 SimulationManager::start_updating_ [Info]: Number of local nodes: 8 Simulation time (ms): 100 Number of OpenMP threads: 2 Number of MPI processes: 1 Nov 02 10:55:31 SimulationManager::run [Info]: Simulation finished.
jiaduxie commented 3 years ago

Do you want to share files between two nodes (wrk0 work1)?Or create the same executable file under the same path under two nodes?

jarsi commented 3 years ago

The nest_vars.sh is only important if you compile nest. If you use conda nest you should not use or source nest_vars.sh. But I think you do not source it, so it's not important.

Do not use mpirun.mpich. It will always run copies of NEST. This is because NEST uses OpenMPI and does not know how to interact with MPICH. Try to get nest running with mpirun. You only need one executable. The file sharing between the two nodes is done by OpenMPI.

Does the output of: /home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python ./multi_test.py change?

jiaduxie commented 3 years ago

I only place multi_test.py in the work0 node directory now.Below is the output:

(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ /home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python /home/work/xjd/nest_multi_test/multi_test.py

python: can't open file '/home/work/xjd/nest_multi_test/multi_test.py': [Errno 2] No such file or directory

Primary job terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[58669,1],1]

Exit code: 2

jiaduxie commented 3 years ago

I want to reinstall CONDA and NEST. What version of CONDA and NEST are you using? Is the installation tutorial the following?

With OpenMPI:

conda create --name ENVNAME -c conda-forge nest-simulator==mpi_openmpi The syntax for this install follows the pattern: nest-simulator==

eg:conda create --name ENVNAME -c conda-forge nest-simulator=2.20=mpi_openmpi

jarsi commented 3 years ago

I installed it this way, it should automatically install the most recent nest version. conda create --name nest_tmp -c conda-forge "nest-simulator=*=mpi_openmpi*"

I think the file multi_test.py cannot be found once the mpi process on the nodes try to run it. With this command you can test whether mpi can access the multi_test.py file from both nodes:

/home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 1 -host work0 ls /home/work/xjd/nest_multi_test/multi_test.py /home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 1 -host work1 ls /home/work/xjd/nest_multi_test/multi_test.py

Could you post the output of both commands?

jiaduxie commented 3 years ago

I re-installed conda and NEST (2.18.0) in another cluster environment, the mpi version is the latest. But it seems not so good.According to the command you issued last time, you can see the multi_test.py file

work01:
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 1 -host work01 ls /home/work/xiejiadu/nest_multi_test/multi_test.py 
/home/work/xiejiadu/nest_multi_test/multi_test.py
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 1 -host work02 ls /home/work/xiejiadu/nest_multi_test/multi_test.py 
/home/work/xiejiadu/nest_multi_test/multi_test.py
work02:
(pynest) work@work02:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 1 -host work01 ls /home/work/xiejiadu/nest_multi_test/multi_test.py
/home/work/xiejiadu/nest_multi_test/multi_test.py
(pynest) work@work02:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 1 -host work02 ls /home/work/xiejiadu/nest_multi_test/multi_test.py
/home/work/xiejiadu/nest_multi_test/multi_test.py
jiaduxie commented 3 years ago

Can you provide a better test code for testing NEST multi-node distributed operation?It is best to write the steps of installing the cluster NEST in the txt file, so as to help me solve the problem that has not been solved.

jarsi commented 3 years ago

The path from the output you just posted differs from the one that you posted before when it was not able to find the file. Maybe because the path was just wrong? Maybe it works now when you use the updated path, i.e.:

/home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python /home/work/xiejiadu/nest_multi_test/multi_test.py

I think the test code is fine. The problems are not NEST related. The problem seems to be the cluster. If you can talk to a system administrator of your cluster you should this. A system administrator know the cluster better, has access to it and has more experience. It is difficult to debug this remotely.

jiaduxie commented 3 years ago

Because this cluster environment is a small network composed of 10 servers in my team, there is no administrator and no resource scheduling system installed.

(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01,work02 python /home/work/xiejiadu/nest_multi_test/multi_test.py

[INFO] [2020.11.3 3:25:11 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 3:25:11 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 3:25:11 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.3 3:25:11 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work02:46171] *** Process received signal ***
[work02:46171] Signal: Aborted (6)
[work02:46171] Signal code:  (-6)
[work02:46171] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7fd99d368730]
[work02:46171] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7fd99d1ca7bb]
[work02:46171] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7fd99d1b5535]
[work02:46171] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7fd99d1b540f]
[work02:46171] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7fd99d1c3102]
[work02:46171] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7fd99009ceb9]
[work02:46171] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7fd99008f229]
[work02:46171] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7fd9900c6666]
[work02:46171] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7fd990085193]
[work02:46171] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7fd990089a32]
[work02:46171] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7fd990089e57]
[work02:46171] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7fd990adba40]
[work02:46171] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7fd990ed74dc]
[work02:46171] [13] python(+0x1b4924)[0x55b901f0b924]
[work02:46171] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x55b901f33bcf]
[work02:46171] [15] python(_PyFunction_Vectorcall+0x1b7)[0x55b901f20637]
[work02:46171] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x55b901f33e2a]
[work02:46171] [17] python(_PyEval_EvalCodeWithName+0x260)[0x55b901f1f490]
[work02:46171] [18] python(+0x1f6bb9)[0x55b901f4dbb9]
[work02:46171] [19] python(+0x13a23d)[0x55b901e9123d]
[work02:46171] [20] python(PyVectorcall_Call+0x6f)[0x55b901eb4f2f]
[work02:46171] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x55b901f396d1]
[work02:46171] [22] python(_PyEval_EvalCodeWithName+0x260)[0x55b901f1f490]
[work02:46171] [23] python(_PyFunction_Vectorcall+0x594)[0x55b901f20a14]
[work02:46171] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x55b901f38583]
[work02:46171] [25] python(_PyFunction_Vectorcall+0x1b7)[0x55b901f20637]
[work02:46171] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x55b901f33bcf]
[work02:46171] [27] python(_PyFunction_Vectorcall+0x1b7)[0x55b901f20637]
[work02:46171] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x55b901f33e2a]
[work02:46171] [29] python(_PyFunction_Vectorcall+0x1b7)[0x55b901f20637]
[work02:46171] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: work01
  PID:        44557
  Message:    connect() to 192.168.204.122:1024 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------
[work01:44552] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193

              -- N E S T --
  Copyright (C) 2004 The NEST Initiative

 Version: nest-2.18.0
 Built: Jan 27 2020 12:49:17

 This program is provided AS IS and comes with
 NO WARRANTY. See the file LICENSE for details.

 Problems or suggestions?
   Visit https://www.nest-simulator.org

 Type 'nest.help()' to find out more about NEST.

Nov 03 03:25:11 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 03 03:25:11 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 03 03:25:11 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 03 03:25:11 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 03 03:25:11 RecordingDevice::set_status [Info]: 
    Data will be recorded to file and to memory.
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 46171 on node work02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
jiaduxie commented 3 years ago

I just used the MPIRUN installed in the system to run it. It seems to work, but they run independently and there is no information interaction.

(pynest) work@work01:~/xiejiadu/nest_multi_test$ /usr/bin/mpirun -np 2 -host work01,work02 python /home/work/xiejiadu/nest_multi_test/multi_test.py

[INFO] [2020.11.3 3:41:32 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 3:41:32 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG

              -- N E S T --
  Copyright (C) 2004 The NEST Initiative

 Version: nest-2.18.0
 Built: Jan 27 2020 12:49:17

 This program is provided AS IS and comes with
 NO WARRANTY. See the file LICENSE for details.

 Problems or suggestions?
   Visit https://www.nest-simulator.org

 Type 'nest.help()' to find out more about NEST.

Nov 03 03:41:32 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 03 03:41:32 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 03 03:41:32 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 03 03:41:32 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 03 03:41:32 RecordingDevice::set_status [Info]: 
    Data will be recorded to file and to memory.

Nov 03 03:41:32 NodeManager::prepare_nodes [Info]: 
    Preparing 12 nodes for simulation.

Nov 03 03:41:32 SimulationManager::start_updating_ [Info]: 
    Number of local nodes: 12
    Simulation time (ms): 100
    Number of OpenMP threads: 4
    Number of MPI processes: 1

Nov 03 03:41:32 SimulationManager::run [Info]: 
    Simulation finished.
[INFO] [2020.11.3 3:41:32 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 3:41:32 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG

              -- N E S T --
  Copyright (C) 2004 The NEST Initiative

 Version: nest-2.18.0
 Built: Jan 27 2020 12:49:17

 This program is provided AS IS and comes with
 NO WARRANTY. See the file LICENSE for details.

 Problems or suggestions?
   Visit https://www.nest-simulator.org

 Type 'nest.help()' to find out more about NEST.

Nov 03 03:41:32 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 03 03:41:32 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 03 03:41:32 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 03 03:41:32 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 03 03:41:32 RecordingDevice::set_status [Info]: 
    Data will be recorded to file and to memory.

Nov 03 03:41:32 NodeManager::prepare_nodes [Info]: 
    Preparing 12 nodes for simulation.

Nov 03 03:41:32 SimulationManager::start_updating_ [Info]: 
    Number of local nodes: 12
    Simulation time (ms): 100
    Number of OpenMP threads: 4
    Number of MPI processes: 1

Nov 03 03:41:32 SimulationManager::run [Info]: 
    Simulation finished.
jiaduxie commented 3 years ago

I now have multi_test.py file in the same directory of the two nodes.

jarsi commented 3 years ago

I think this is an important piece of information:

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: work01
  PID:        44557
  Message:    connect() to 192.168.204.122:1024 failed
  Error:      Operation now in progress (115)

It seems as if the nodes don't know how to communicate with each other. Maybe we can find a way to tell them. Could you run ip addr, this could hopefully give hints which ways of communication between the nodes exist.

Additionally, can you check the output of ip addr on the nodes:

ssh work01
ip addr

and

ssh work02
ip addr
jiaduxie commented 3 years ago

Ao, have you ever run multi-area-model in your own environment?My current cluster environment is composed of 9 servers, each with 4 CPUs and 176 cores.I think TCP communication is fine, they are in the same LAN, but I also configured password-free login.

(pynest) work@work01:~/xiejiadu/nest_multi_test$ ssh work02
Linux work02 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Nov  3 02:01:52 2020 from 192.168.112.31
work@work02:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp39s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b4:05:5d:50:9c:d0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.204.122/24 brd 192.168.204.255 scope global enp39s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::b605:5dff:fe50:9cd0/64 scope link 
       valid_lft forever preferred_lft forever
3: enp39s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b4:05:5d:50:9c:d1 brd ff:ff:ff:ff:ff:ff
4: enp39s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b4:05:5d:50:9c:d2 brd ff:ff:ff:ff:ff:ff
5: enp39s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b4:05:5d:50:9c:d3 brd ff:ff:ff:ff:ff:ff
(pynest) work@work02:~/xiejiadu/nest_multi_test$ ssh work01
Linux work01 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Nov  3 02:01:50 2020 from 192.168.112.31
work@work01:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp39s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether b4:05:5d:48:4c:42 brd ff:ff:ff:ff:ff:ff
    inet 192.168.204.121/24 brd 192.168.204.255 scope global enp39s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::b605:5dff:fe48:4c42/64 scope link 
       valid_lft forever preferred_lft forever
3: enp39s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether b4:05:5d:48:4c:43 brd ff:ff:ff:ff:ff:ff
4: enp39s0f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether b4:05:5d:48:4c:44 brd ff:ff:ff:ff:ff:ff
5: enp39s0f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether b4:05:5d:48:4c:45 brd ff:ff:ff:ff:ff:ff
jarsi commented 3 years ago

Yes, the multi-area model runs without any problems.

I have never had such problems. The systems I use are ready and we do not need to worry about mpi communication and such.

Could you try: /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01,work02 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py

Here I added -mca btl_tcp_if_include enp39s0f0. I think this should make tcp use only the enp39s0f0 interface for communication. ip addr revealed the name of the interface.

jiaduxie commented 3 years ago

It doesn't seem to be good, it's still the previous mistake.

(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01,work02 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py

[INFO] [2020.11.3 6:19:35 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 6:19:35 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.3 6:19:35 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 6:19:35 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work02:47789] *** Process received signal ***
[work02:47789] Signal: Aborted (6)
[work02:47789] Signal code:  (-6)
[work02:47789] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f3297335730]
[work02:47789] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f32971977bb]
[work02:47789] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f3297182535]
[work02:47789] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7f329718240f]
[work02:47789] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7f3297190102]
[work02:47789] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7f328a069eb9]
[work02:47789] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7f328a05c229]
[work02:47789] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7f328a093666]
[work02:47789] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7f328a052193]
[work02:47789] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7f328a056a32]
[work02:47789] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7f328a056e57]
[work02:47789] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7f328aaa8a40]
[work02:47789] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7f328aea44dc]
[work02:47789] [13] python(+0x1b4924)[0x561faf0fc924]
[work02:47789] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x561faf124bcf]
[work02:47789] [15] python(_PyFunction_Vectorcall+0x1b7)[0x561faf111637]
[work02:47789] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x561faf124e2a]
[work02:47789] [17] python(_PyEval_EvalCodeWithName+0x260)[0x561faf110490]
[work02:47789] [18] python(+0x1f6bb9)[0x561faf13ebb9]
[work02:47789] [19] python(+0x13a23d)[0x561faf08223d]
[work02:47789] [20] python(PyVectorcall_Call+0x6f)[0x561faf0a5f2f]
[work02:47789] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x561faf12a6d1]
[work02:47789] [22] python(_PyEval_EvalCodeWithName+0x260)[0x561faf110490]
[work02:47789] [23] python(_PyFunction_Vectorcall+0x594)[0x561faf111a14]
[work02:47789] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x561faf129583]
[work02:47789] [25] python(_PyFunction_Vectorcall+0x1b7)[0x561faf111637]
[work02:47789] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x561faf124bcf]
[work02:47789] [27] python(_PyFunction_Vectorcall+0x1b7)[0x561faf111637]
[work02:47789] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x561faf124e2a]
[work02:47789] [29] python(_PyFunction_Vectorcall+0x1b7)[0x561faf111637]
[work02:47789] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: work01
  PID:        46244
  Message:    connect() to 192.168.204.122:1024 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------
[work01:46239] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193

              -- N E S T --
  Copyright (C) 2004 The NEST Initiative

 Version: nest-2.18.0
 Built: Jan 27 2020 12:49:17

 This program is provided AS IS and comes with
 NO WARRANTY. See the file LICENSE for details.

 Problems or suggestions?
   Visit https://www.nest-simulator.org

 Type 'nest.help()' to find out more about NEST.

Nov 03 06:19:35 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 03 06:19:35 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 03 06:19:35 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 03 06:19:35 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 03 06:19:35 RecordingDevice::set_status [Info]: 
    Data will be recorded to file and to memory.
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 47789 on node work02 exited on signal 6 (Aborted).
jiaduxie commented 3 years ago

The above error is about OpenMPI. In my understanding, OpenMPI can only be multi-threaded on one node, and cannot be used on multiple nodes.Is this distributed multi-node run with the mpirun command?

jarsi commented 3 years ago

OpenMPI distributes MPI processes. Here we distribute 2 mpi processes (-np 2) on two nodes. If you specify 4 virtual processes in nest, nest will understand that 2 of those are mpi processes and thus spawn 2 threads on each node.