Open jiaduxie opened 3 years ago
Hi jiaduxie,
could you provide a little bit more details about the problems you are running in?
The Readme.md of this repository has already some points on how to set up the environment and run the model.
The exact steps you have to take depend on the cluster you are using. What kind of job scheduler does it use? Is Python 3 available? Does it have MPI support? Have you already installed NEST?
A rough sketch:
1) Manually compile NEST on your cluster, make sure python and MPI is supported. Do not use the conda version (no MPI nor OpenMP support). Use an official release (NEST master has features which are not implemented in this repository yet). Depending on your cluster you need to load some packages (eg. Python, MPI...)
2) Make sure to install all python packages listed in requirements.txt.
Run:
pip3 install -r requirements.txt
If the cluster does not allow this, try:
pip3 install --user -r requirements.txt
3) Inform the job scheduler on your system and how to run the job
You will need to copy the file config_template.py to config.py. Change base_path
to the absolute path of the multi-area repository. Change data_path
to the path where you want to store the output of the simulation. Adapt the jobscript_template
to your system. If it is SLURM, then there is already an example in place which you will could uncomment and try to use. Make sure to also load all packages that a multi-node simulation requires (eg. MPI). Change submit_cmd
from None to the command your job scheduler uses. For SLURM it is sbatch.
4) Try to run run_example_fullscale.py
Now you should be able to run
python run_example_fullscale.py
.
This will set up the simulation environment, do all the preprocessing and finally submit the job to the cluster. Depending on your cluster you might want to change num_processes
and local_num_threads
.
If all of this works, you should be ready to run your own experiments.
I hope this helps! Best, Jari
Oh,thinks Jari. I run the model in NEST of conda environment .But I installed the conda version of nest with MPI version,it also no?I try install NEST from source code,The MPI in NEST manually installed by myself conflict with the local MPI of the server?I'll ask you if you have any questions.
I am only aware of a conda version of NEST which does not have MPI support. But maybe it exists.
To check whether your NEST version supports MPI and OpenMP, could you run in your environment the following command and post the output:
python -c "import nest; nest.Simulate(1.)"
My conda installed NEST gives me in the startupdating information that neither MPI nor OpenMP is available:
Sep 10 16:07:01 SimulationManager::startupdating [Info]: Number of local nodes: 0 Simulation time (ms): 1 Not using OpenMP Not using MPI
Concerning manual compilation. How did you try to compile NEST? Could you post what steps you have tried so far?
I haven't started trying to compile manually. I run the following in my conda environment,and the output is as follows:
$python -c "import nest; nest.Simulate(1.)"
Creating default RNGs Creating new default global RNG -- N E S T -- Copyright (C) 2004 The NEST Initiative Version: nest-2.18.0 Built: Jan 27 2020 12:49:17
This program is provided AS IS and comes with NO WARRANTY. See the file LICENSE for details.
Problems or suggestions? Visit https://www.nest-simulator.org Type 'nest.help()' to find out more about NEST.
Sep 10 22:20:26 NodeManager::prepare_nodes [Info]: Preparing 0 nodes for simulation.
Sep 10 22:20:26 SimulationManager::startupdating [Info]: Number of local nodes: 0 Simulation time (ms): 1 Number of OpenMP threads: 1 Number of MPI processes: 1
Sep 10 22:20:26 SimulationManager::run [Info]: Simulation finished.
It seems alright. Have you installed the packages from requirements.txt? Have you tired running a simulation?
Yes,I have installed the packages from requirements.txt?Can you help me see if the command to execute multi-node simulation is like this?: mpirun -hostfile hostfile python run_example_downscaled.py
The hostfile is following: work0 slots = 2 work1 slots = 2
I have no experience with hostfiles, but it looks reasonable to me. Have you adjusted num_processes
and local_num_threads
in the sim_dict? Have you tried running it? Did it work?
The run_example_downscaled.py is meant to be run on a local machine, for example a laptop. If you would like to experiment on a compute cluster you should exchange M.simulation.simulate()
with start_job(M.simulation.label, submit_cmd, jobscript_template)
(see run_example_fullscale.py
) and additionally import:
from start_jobs import start_job
from config import submit_cmd, jobscript_template
In this case you need to invoke the script serially:
python run_example.py
The parallelized part is then specified in the jobscript_template in config.py.
Hei,jarsi .If run a complete model on a cluster of two servers, about how much memory each machine needs to support?
The model consumes approximately 1 TB of memory. So with two servers each server would need to provide 500 GB.
Okay, thank you. Then,when you run the entire model, you use several servers and how much memory each is.
Hei,jarsi.In what system are you running multiple nodes in parallel.My system is ubuntu, slurm configuration is not good.Do you have any guidance on configuring the environment?
Hi, we do not set up the systems ourselves. We use for example JURECA from the Forschungszentrum Juelich It has everything we need already installed. What kind of system are you using?
I am a server under linux system, the release version is ubuntu.In addition to running under JURECA, do you have your own running on a general server?
Hi,jarsi,I am now simulating a small network for testing on two machines, and run it with the following command. It seems that the two machines run by themselves without interaction.
`mpirun.mpich -np 2 -host work0,work1 python ./multi_test.py`
In addition,Have you run his model of multi-area-model in your own cluster environment?
This is weird. Have you adjusted the num_processes
or local_num_threads
variable in the sim_params
dictionary? An example of how to do this is shown in the run_example_fullscale.py
file. In your case you should set num_processes=2
. These variables are needed in order to inform NEST about distributed computing.
Maybe you could also post what is in your multi_test.py
file?
I have run the model on a local cluster. I usually just need to modify the run_example_fullscale.py
and config.py
to my own needs.
multi_test.py:
from nest import * SetKernelStatus({"total_num_virtual_procs": 4}) pg = Create("poisson_generator", params={"rate": 50000.0}) n = Create("iaf_psc_alpha", 4) sd = Create("spike_detector", params={"to_file": True}) Connect(pg, [n[0]], syn_spec={'weight': 1000.0, 'delay': 1.0}) Connect([n[0]], [n[1]], syn_spec={'weight': 1000.0, 'delay': 1.0}) Connect([n[1]], [n[2]], syn_spec={'weight': 1000.0, 'delay': 1.0}) Connect([n[2]], [n[3]], syn_spec={'weight': 1000.0, 'delay': 1.0}) Connect(n, sd) Simulate(100.0)
This is difficult for me to debug. On my machine I can run this without running into errors. It works with the conda installed nest (conda create --name nest_conda -c conda-forge 'nest-simulator=*=mpi_openmpi*' python
) and with nest compiled from source. I suspect there might be a problem with the host file. Unfortunately I do not know a lot about those, usually the system administrators take care of this.
On you machine, are you using any resource manager such as e.g., SLURM, PBS/Torque, LSF, etc. Or are you responsible for defining everything correctly using hostfiles? What kind of system are you using?
The cluster environment I use is composed of nine ordinary server machines. The system is Linux, and the release version number is debain.You run this model on a supercomputer, right? Have you ever run in your own environment? Is it necessary to install SLURM resource scheduling system?I also had a lot of problems in the process of installing SLURM, so I won't install it.
It is not necessary to install SLURM. But I have most experience with it as all clusters I have used so far had SLURM installed. Installing a resource manager is not trivial and should be the job of a system admin, not the user. Do you have a system administrator you could ask for help? How do other people run distributed jobs on this cluster?
Could you also try the following commands and report whether something changes:
mpiexec -np 2 -host work0,work1 python ./multi_test.py
mpirun -np 2 -host work0,work1 python ./multi_test.py
Because my cluster environment here is composed of general servers, there is no resource scheduling system such as SLURM installed. It seems that the command you said can not complete the simulation well.
ORTE was unable to reliably start one or more daemons. This usually is caused by:
not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type.
ORTE was unable to reliably start one or more daemons. This usually is caused by:
not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type.
Just to make sure, you are using nest installed via conda, right?
What do the following commands give you:
conda list
which mpirun
which mpiexec
which mpirun.mpich
Yes, I installed nest under conda.I seem to have installed it
(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ conda list llvm-meta 7.0.0 0 conda-forge matplotlib 3.3.0 pypi_0 pypi mpi 1.0 openmpi conda-forge mpi4py 3.0.3 py38h246a051_1 conda-forge ncurses 6.2 he1b5a44_1 conda-forge nest-simulator 2.18.0 mpi_openmpi_py38h72811e1_7 conda-forge nested-dict 1.61 pypi_0 pypi numpy 1.19.1 py38h8854b6b_0 conda-forge openmp 7.0.0 h2d50403_0 conda-forge openmpi 4.0.4 hdf1f1ad_0 conda-forge openssh 8.3p1 h5957347_0 conda-forge openssl 1.1.1g h516909a_1 conda-forge pandas 1.1.0 py38h950e882_0 conda-forge
(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ which mpirun /home/work/anaconda3/envs/pynest_mpi/bin/mpirun (pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ which mpiexec /home/work/anaconda3/envs/pynest_mpi/bin/mpiexec (pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ which mpirun.mpich
Ok thanks, the output of the last command is missing.
Using conda list
you can see that nest is linked against open MPI. This is one of many MPI libraries. the command mpirun.mpich
, to my understanding, instructs mpi to use the mpich version of MPI. This is different from the open MPI version that nest is linked against. These two versions are not compatible, as we can also see when you use mpirun.mpich
. Both mpiexec
and mpirun
are installed inside of your conda environment and should be compatible with nest. I don't understand why you get the error message when using these.
Maybe you could also check the output of:
mpirun --version
mpirun.mpich --version
(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ which mpirun.mpich /usr/bin/mpirun.mpich (pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ mpirun --version mpmpirun (Open MPI) 4.0.4 Report bugs to http://www.open-mpi.org/community/help/ (pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ mpirun.mpich --version HYDRA build details: Version: 3.3a2 Release Date: Sun Nov 13 09:12:11 MST 2016 CC: gcc -Wl,-Bsymbolic-functions -Wl,-z,relro CXX: g++ -Wl,-Bsymbolic-functions -Wl,-z,relro F77: gfortran -Wl,-Bsymbolic-functions -Wl,-z,relro F90: gfortran -Wl,-Bsymbolic-functions -Wl,-z,relro Configure options: '--disable-option-checking' '--prefix=/usr' '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--with-libfabric' '--enable-shared' '--enable-fortran=all' '--disable-rpath' '--disable-wrapper-rpath' '--sysconfdir=/etc/mpich' '--libdir=/usr/lib/x86_64-linux-gnu' '--includedir=/usr/include/mpich' '--docdir=/usr/share/doc/mpich' '--with-hwloc-prefix=system' '--enable-checkpointing' '--with-hydra-ckpointlib=blcr' 'CPPFLAGS= -Wdate-time -D_FORTIFY_SOURCE=2 -I/build/mpich-O9at2o/mpich-3.3~a2/src/mpl/include -I/build/mpich-O9at2o/mpich-3.3~a2/src/mpl/include -I/build/mpich-O9at2o/mpich-3.3~a2/src/openpa/src -I/build/mpich-O9at2o/mpich-3.3~a2/src/openpa/src -D_REENTRANT -I/build/mpich-O9at2o/mpich-3.3~a2/src/mpi/romio/include' 'CFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -Wformat -Werror=format-security -O2' 'CXXFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -Wformat -Werror=format-security -O2' 'FFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -O2' 'FCFLAGS= -g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -O2' 'build_alias=x86_64-linux-gnu' 'MPICHLIB_CFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -Wformat -Werror=format-security' 'MPICHLIB_CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'MPICHLIB_CXXFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong -Wformat -Werror=format-security' 'MPICHLIB_FFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong' 'MPICHLIB_FCFLAGS=-g -O2 -fdebug-prefix-map=/build/mpich-O9at2o/mpich-3.3~a2=. -fstack-protector-strong' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'FC=gfortran' 'F77=gfortran' 'MPILIBNAME=mpich' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'LIBS=' 'MPLLIBNAME=mpl' Process Manager: pmi Launchers available: ssh rsh fork slurm ll lsf sge manual persist Topology libraries available: hwloc Resource management kernels available: user slurm ll lsf sge pbs cobalt Checkpointing libraries available: blcr Demux engines available: poll select
I think the problem is that once the jobs start to run on a node the mpi library cannot be found. This is because the PATH
and LD_LIBRARY_PATH
are not exported. Could you try the following:
mpirun --prefix /home/work/anaconda3/envs/pynest_mpi/bin -np 2 -host work0,work1 python ./multi_test.py
Hi, have you made progress?
I think the problems you are seeing are related to your mpi libraries. As the conda nest is compiled against openMPI, you must also use openMPI and not mpich. This means that mpirun
should be the command you should use. But we are seeing that this does not work. My guess is, that once nest starts to run on the nodes it does not find the correct MPI library, gets confused and the nest instances run independently because they do not know how to use MPI. According to the OpenMPI FAQ you can try several things.
1) Specify which mpi library to use via --prefix
. I think in my previous message there might have been an error in the prefix.
mpirun --prefix /home/work/anaconda3/envs/pynest_mpi -np 2 -host work0,work1 python ./multi_test.py
2) Specify which mpi library to use via using the complete openMPI path
/home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python ./multi_test.py
3) Add the following to ~/.profile
export PATH=/home/work/anaconda3/envs/pynest_mpi/bin:$PATH
export LD_LIBRARY_PATH=/home/work/anaconda3/envs/pynest_mpi/:$LD_LIBRARY_PATH
Does any of these approaches work or change the error message?
I've tried it and it's still not good. Did you use conda to install nest or compile from source code?
I tried it with both and both worked on my machine. What is the output of the different approaches posted above?
It seems that I was running on work0, but work1 (Ubuntu16) terminated the job. Do you also use mpirun to run it? I am running under conda, and the compiled version of the source code is wrong. There are many kernel functions that are not defined and seem to be the old version.
(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ /home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python ./multi_test.py
[INFO] [2020.10.30 21:26:1 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::createrngs] : Creating default RNGs [INFO] [2020.10.30 21:26:1 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::creategrng] : Creating new default global RNG [INFO] [2020.10.30 21:26:1 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::createrngs] : Creating default RNGs [INFO] [2020.10.30 21:26:1 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::creategrng] : Creating new default global RNG python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed. [ubuntu16:18103] Process received signal [ubuntu16:18103] Signal: Aborted (6) [ubuntu16:18103] Signal code: (-6) [ubuntu16:18103] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fe0987a2890] [ubuntu16:18103] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fe0983dde97] [ubuntu16:18103] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fe0983df801] [ubuntu16:18103] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x3039a)[0x7fe0983cf39a] [ubuntu16:18103] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30412)[0x7fe0983cf412] [ubuntu16:18103] [ 5] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7fe08ad39eb9] [ubuntu16:18103] [ 6] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7fe08ad2c229] [ubuntu16:18103] [ 7] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7fe08ad63666] [ubuntu16:18103] [ 8] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7fe08ad22193] [ubuntu16:18103] [ 9] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7fe08ad26a32] [ubuntu16:18103] [10] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7fe08ad26e57] [ubuntu16:18103] [11] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7fe08b779a40] [ubuntu16:18103] [12] /home/work/anaconda3/envs/pynest_mpi/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7fe08bb764dc] [ubuntu16:18103] [13] python(+0x1b2f24)[0x56225ad72f24] [ubuntu16:18103] [14] python(_PyEval_EvalFrameDefault+0x4bd)[0x56225ad9c83d] [ubuntu16:18103] [15] python(_PyFunction_Vectorcall+0x1b7)[0x56225ad89197] [ubuntu16:18103] [16] python(_PyEval_EvalFrameDefault+0x71b)[0x56225ad9ca9b] [ubuntu16:18103] [17] python(_PyEval_EvalCodeWithName+0x260)[0x56225ad87ff0] [ubuntu16:18103] [18] python(+0x1f68ca)[0x56225adb68ca] [ubuntu16:18103] [19] python(+0x139ffd)[0x56225acf9ffd] [ubuntu16:18103] [20] python(PyVectorcall_Call+0x6e)[0x56225ad1ddee] [ubuntu16:18103] [21] python(_PyEval_EvalFrameDefault+0x60fd)[0x56225ada247d] [ubuntu16:18103] [22] python(_PyEval_EvalCodeWithName+0x260)[0x56225ad87ff0] [ubuntu16:18103] [23] python(_PyFunction_Vectorcall+0x594)[0x56225ad89574] [ubuntu16:18103] [24] python(_PyEval_EvalFrameDefault+0x4ea3)[0x56225ada1223] [ubuntu16:18103] [25] python(_PyFunction_Vectorcall+0x1b7)[0x56225ad89197] [ubuntu16:18103] [26] python(_PyEval_EvalFrameDefault+0x4bd)[0x56225ad9c83d] [ubuntu16:18103] [27] python(_PyFunction_Vectorcall+0x1b7)[0x56225ad89197] [ubuntu16:18103] [28] python(_PyEval_EvalFrameDefault+0x71b)[0x56225ad9ca9b] [ubuntu16:18103] [29] python(_PyFunction_Vectorcall+0x1b7)[0x56225ad89197] [ubuntu16:18103] End of error message
-- N E S T --
Copyright (C) 2004 The NEST Initiative
Version: nest-2.18.0 Built: Jan 27 2020 12:49:17
This program is provided AS IS and comes with NO WARRANTY. See the file LICENSE for details.
Problems or suggestions? Visit https://www.nest-simulator.org
Type 'nest.help()' to find out more about NEST.
Oct 30 21:26:01 ModelManager::clearmodels [Info]: Models will be cleared and parameters reset.
Oct 30 21:26:01 Network::createrngs [Info]: Deleting existing random number generators
Oct 30 21:26:01 Network::createrngs [Info]: Creating default RNGs
Oct 30 21:26:01 Network::creategrng [Info]: Creating new default global RNG
mpirun noticed that process rank 1 with PID 18103 on node work1 exited on signal 6 (Aborted).
Thanks for posting the output.
Just to be sure: You are using conda nest, and not a version that you compiled yourself? You should not mix those. You do not run source nest_vars.sh
, also not in your .bashrc
, right? This is important.
Furthermore I saw in your output that you are running nest 2.18. There is a newer version, 2.20. Could you update the conda nest and run the commands again? It might be a broken version, as you suggested.
How to update nest under conda?Which directory do I need to run nest_vars.sh?
(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ sudo find / -name nest_vars.sh
/usr/bin/nest_vars.sh
/home/work/anaconda3/envs/pynest_mpi/bin/nest_vars.sh
/home/work/anaconda3/envs/newnest/bin/nest_vars.sh
/home/work/anaconda3/envs/nest/bin/nest_vars.sh
/home/work/anaconda3/pkgs/nest-simulator-2.18.0-nompi_py37hf650cc7_107/bin/nest_vars.sh
/home/work/anaconda3/pkgs/nest-simulator-2.18.0-mpi_openmpi_py38h72811e1_7/bin/nest_vars.sh
/root/anaconda3/envs/pynest/bin/nest_vars.sh
/root/anaconda3/envs/PYNEST/bin/nest_vars.sh
/root/anaconda3/pkgs/nest-simulator-2.16.0-mpi_openmpi_py37h0bdc58b_1003/bin/nest_vars.sh
/root/anaconda3/pkgs/nest-simulator-2.18.0-mpi_openmpi_py38h72811e1_7/bin/nest_vars.sh
/var/lib/docker/overlay2/828de0c0d1756b1f9357a6838ef12af765ec4b2473455595c84455be4ce94df7/diff/opt/nest/bin/nest_vars.sh
/var/lib/docker/overlay2/6b0dca5c25f9f1e73db9727dcf091c47cf241af1a186d281c8dcfd96637dd261/diff/opt/nest/bin/nest_vars.sh
/var/lib/docker/overlay2/166e90235e8f160fc32faf5b5e98d21770b4c7addbed5ebb3f6186904393c35e/diff/opt/nest/bin/nest_vars.sh
I can run the program with the following command, but it seems that work0 and work1 have no interactive information. But independent of each other, running their own.
(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ mpirun.mpich -np 2 -host work0,work1 python3 multi_test.py
[INFO] [2020.11.2 10:55:31 /build/nest-phUkz7/nest-2.20.0/nestkernel/rng_manager.cpp:217 @ Network::createrngs] : Creating default RNGs [INFO] [2020.11.2 10:55:31 /build/nest-phUkz7/nest-2.20.0/nestkernel/rng_manager.cpp:260 @ Network::creategrng] : Creating new default global RNG
-- N E S T --
Copyright (C) 2004 The NEST Initiative
Version: nest-2.20.0 Built: Sep 19 2020 07:15:24
This program is provided AS IS and comes with NO WARRANTY. See the file LICENSE for details.
Problems or suggestions? Visit https://www.nest-simulator.org
Type 'nest.help()' to find out more about NEST.
Nov 02 10:55:31 ModelManager::clearmodels [Info]: Models will be cleared and parameters reset.
Nov 02 10:55:31 Network::createrngs [Info]: Deleting existing random number generators
Nov 02 10:55:31 Network::createrngs [Info]: Creating default RNGs
Nov 02 10:55:31 Network::creategrng [Info]: Creating new default global RNG
Nov 02 10:55:31 RecordingDevice::set_status [Info]: Data will be recorded to file and to memory.
Nov 02 10:55:31 NodeManager::prepare_nodes [Info]: Preparing 8 nodes for simulation.
Nov 02 10:55:31 SimulationManager::startupdating [Info]: Number of local nodes: 8 Simulation time (ms): 100 Number of OpenMP threads: 2 Number of MPI processes: 1
Nov 02 10:55:31 SimulationManager::run [Info]: Simulation finished.
Do you want to share files between two nodes (wrk0 work1)?Or create the same executable file under the same path under two nodes?
The nest_vars.sh
is only important if you compile nest. If you use conda nest you should not use or source nest_vars.sh
. But I think you do not source it, so it's not important.
Do not use mpirun.mpich
. It will always run copies of NEST. This is because NEST uses OpenMPI and does not know how to interact with MPICH. Try to get nest running with mpirun
. You only need one executable. The file sharing between the two nodes is done by OpenMPI.
Does the output of:
/home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python ./multi_test.py
change?
I only place multi_test.py in the work0 node directory now.Below is the output:
(pynest_mpi) work@lyjteam-server:~/xjd/nest_multi_test$ /home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python /home/work/xjd/nest_multi_test/multi_test.py
python: can't open file '/home/work/xjd/nest_multi_test/multi_test.py': [Errno 2] No such file or directory
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[58669,1],1]
Exit code: 2
I want to reinstall CONDA and NEST. What version of CONDA and NEST are you using? Is the installation tutorial the following?
With OpenMPI:
conda create --name ENVNAME -c conda-forge nest-simulator==mpi_openmpi
The syntax for this install follows the pattern: nest-simulator=
eg:conda create --name ENVNAME -c conda-forge nest-simulator=2.20=mpi_openmpi
I installed it this way, it should automatically install the most recent nest version.
conda create --name nest_tmp -c conda-forge "nest-simulator=*=mpi_openmpi*"
I think the file multi_test.py
cannot be found once the mpi process on the nodes try to run it. With this command you can test whether mpi can access the multi_test.py
file from both nodes:
/home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 1 -host work0 ls /home/work/xjd/nest_multi_test/multi_test.py
/home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 1 -host work1 ls /home/work/xjd/nest_multi_test/multi_test.py
Could you post the output of both commands?
I re-installed conda and NEST (2.18.0) in another cluster environment, the mpi version is the latest. But it seems not so good.According to the command you issued last time, you can see the multi_test.py file
work01:
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 1 -host work01 ls /home/work/xiejiadu/nest_multi_test/multi_test.py
/home/work/xiejiadu/nest_multi_test/multi_test.py
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 1 -host work02 ls /home/work/xiejiadu/nest_multi_test/multi_test.py
/home/work/xiejiadu/nest_multi_test/multi_test.py
work02:
(pynest) work@work02:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 1 -host work01 ls /home/work/xiejiadu/nest_multi_test/multi_test.py
/home/work/xiejiadu/nest_multi_test/multi_test.py
(pynest) work@work02:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 1 -host work02 ls /home/work/xiejiadu/nest_multi_test/multi_test.py
/home/work/xiejiadu/nest_multi_test/multi_test.py
Can you provide a better test code for testing NEST multi-node distributed operation?It is best to write the steps of installing the cluster NEST in the txt file, so as to help me solve the problem that has not been solved.
The path from the output you just posted differs from the one that you posted before when it was not able to find the file. Maybe because the path was just wrong? Maybe it works now when you use the updated path, i.e.:
/home/work/anaconda3/envs/pynest_mpi/bin/mpirun -np 2 -host work0,work1 python /home/work/xiejiadu/nest_multi_test/multi_test.py
I think the test code is fine. The problems are not NEST related. The problem seems to be the cluster. If you can talk to a system administrator of your cluster you should this. A system administrator know the cluster better, has access to it and has more experience. It is difficult to debug this remotely.
Because this cluster environment is a small network composed of 10 servers in my team, there is no administrator and no resource scheduling system installed.
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01,work02 python /home/work/xiejiadu/nest_multi_test/multi_test.py
[INFO] [2020.11.3 3:25:11 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 3:25:11 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 3:25:11 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.3 3:25:11 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work02:46171] *** Process received signal ***
[work02:46171] Signal: Aborted (6)
[work02:46171] Signal code: (-6)
[work02:46171] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7fd99d368730]
[work02:46171] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7fd99d1ca7bb]
[work02:46171] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7fd99d1b5535]
[work02:46171] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7fd99d1b540f]
[work02:46171] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7fd99d1c3102]
[work02:46171] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7fd99009ceb9]
[work02:46171] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7fd99008f229]
[work02:46171] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7fd9900c6666]
[work02:46171] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7fd990085193]
[work02:46171] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7fd990089a32]
[work02:46171] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7fd990089e57]
[work02:46171] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7fd990adba40]
[work02:46171] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7fd990ed74dc]
[work02:46171] [13] python(+0x1b4924)[0x55b901f0b924]
[work02:46171] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x55b901f33bcf]
[work02:46171] [15] python(_PyFunction_Vectorcall+0x1b7)[0x55b901f20637]
[work02:46171] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x55b901f33e2a]
[work02:46171] [17] python(_PyEval_EvalCodeWithName+0x260)[0x55b901f1f490]
[work02:46171] [18] python(+0x1f6bb9)[0x55b901f4dbb9]
[work02:46171] [19] python(+0x13a23d)[0x55b901e9123d]
[work02:46171] [20] python(PyVectorcall_Call+0x6f)[0x55b901eb4f2f]
[work02:46171] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x55b901f396d1]
[work02:46171] [22] python(_PyEval_EvalCodeWithName+0x260)[0x55b901f1f490]
[work02:46171] [23] python(_PyFunction_Vectorcall+0x594)[0x55b901f20a14]
[work02:46171] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x55b901f38583]
[work02:46171] [25] python(_PyFunction_Vectorcall+0x1b7)[0x55b901f20637]
[work02:46171] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x55b901f33bcf]
[work02:46171] [27] python(_PyFunction_Vectorcall+0x1b7)[0x55b901f20637]
[work02:46171] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x55b901f33e2a]
[work02:46171] [29] python(_PyFunction_Vectorcall+0x1b7)[0x55b901f20637]
[work02:46171] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: work01
PID: 44557
Message: connect() to 192.168.204.122:1024 failed
Error: Operation now in progress (115)
--------------------------------------------------------------------------
[work01:44552] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
-- N E S T --
Copyright (C) 2004 The NEST Initiative
Version: nest-2.18.0
Built: Jan 27 2020 12:49:17
This program is provided AS IS and comes with
NO WARRANTY. See the file LICENSE for details.
Problems or suggestions?
Visit https://www.nest-simulator.org
Type 'nest.help()' to find out more about NEST.
Nov 03 03:25:11 ModelManager::clear_models_ [Info]:
Models will be cleared and parameters reset.
Nov 03 03:25:11 Network::create_rngs_ [Info]:
Deleting existing random number generators
Nov 03 03:25:11 Network::create_rngs_ [Info]:
Creating default RNGs
Nov 03 03:25:11 Network::create_grng_ [Info]:
Creating new default global RNG
Nov 03 03:25:11 RecordingDevice::set_status [Info]:
Data will be recorded to file and to memory.
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 46171 on node work02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
I just used the MPIRUN installed in the system to run it. It seems to work, but they run independently and there is no information interaction.
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /usr/bin/mpirun -np 2 -host work01,work02 python /home/work/xiejiadu/nest_multi_test/multi_test.py
[INFO] [2020.11.3 3:41:32 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 3:41:32 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
-- N E S T --
Copyright (C) 2004 The NEST Initiative
Version: nest-2.18.0
Built: Jan 27 2020 12:49:17
This program is provided AS IS and comes with
NO WARRANTY. See the file LICENSE for details.
Problems or suggestions?
Visit https://www.nest-simulator.org
Type 'nest.help()' to find out more about NEST.
Nov 03 03:41:32 ModelManager::clear_models_ [Info]:
Models will be cleared and parameters reset.
Nov 03 03:41:32 Network::create_rngs_ [Info]:
Deleting existing random number generators
Nov 03 03:41:32 Network::create_rngs_ [Info]:
Creating default RNGs
Nov 03 03:41:32 Network::create_grng_ [Info]:
Creating new default global RNG
Nov 03 03:41:32 RecordingDevice::set_status [Info]:
Data will be recorded to file and to memory.
Nov 03 03:41:32 NodeManager::prepare_nodes [Info]:
Preparing 12 nodes for simulation.
Nov 03 03:41:32 SimulationManager::start_updating_ [Info]:
Number of local nodes: 12
Simulation time (ms): 100
Number of OpenMP threads: 4
Number of MPI processes: 1
Nov 03 03:41:32 SimulationManager::run [Info]:
Simulation finished.
[INFO] [2020.11.3 3:41:32 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 3:41:32 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
-- N E S T --
Copyright (C) 2004 The NEST Initiative
Version: nest-2.18.0
Built: Jan 27 2020 12:49:17
This program is provided AS IS and comes with
NO WARRANTY. See the file LICENSE for details.
Problems or suggestions?
Visit https://www.nest-simulator.org
Type 'nest.help()' to find out more about NEST.
Nov 03 03:41:32 ModelManager::clear_models_ [Info]:
Models will be cleared and parameters reset.
Nov 03 03:41:32 Network::create_rngs_ [Info]:
Deleting existing random number generators
Nov 03 03:41:32 Network::create_rngs_ [Info]:
Creating default RNGs
Nov 03 03:41:32 Network::create_grng_ [Info]:
Creating new default global RNG
Nov 03 03:41:32 RecordingDevice::set_status [Info]:
Data will be recorded to file and to memory.
Nov 03 03:41:32 NodeManager::prepare_nodes [Info]:
Preparing 12 nodes for simulation.
Nov 03 03:41:32 SimulationManager::start_updating_ [Info]:
Number of local nodes: 12
Simulation time (ms): 100
Number of OpenMP threads: 4
Number of MPI processes: 1
Nov 03 03:41:32 SimulationManager::run [Info]:
Simulation finished.
I now have multi_test.py file in the same directory of the two nodes.
I think this is an important piece of information:
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: work01
PID: 44557
Message: connect() to 192.168.204.122:1024 failed
Error: Operation now in progress (115)
It seems as if the nodes don't know how to communicate with each other. Maybe we can find a way to tell them. Could you run ip addr
, this could hopefully give hints which ways of communication between the nodes exist.
Additionally, can you check the output of ip addr
on the nodes:
ssh work01
ip addr
and
ssh work02
ip addr
Ao, have you ever run multi-area-model in your own environment?My current cluster environment is composed of 9 servers, each with 4 CPUs and 176 cores.I think TCP communication is fine, they are in the same LAN, but I also configured password-free login.
(pynest) work@work01:~/xiejiadu/nest_multi_test$ ssh work02
Linux work02 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Nov 3 02:01:52 2020 from 192.168.112.31
work@work02:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp39s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether b4:05:5d:50:9c:d0 brd ff:ff:ff:ff:ff:ff
inet 192.168.204.122/24 brd 192.168.204.255 scope global enp39s0f0
valid_lft forever preferred_lft forever
inet6 fe80::b605:5dff:fe50:9cd0/64 scope link
valid_lft forever preferred_lft forever
3: enp39s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b4:05:5d:50:9c:d1 brd ff:ff:ff:ff:ff:ff
4: enp39s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b4:05:5d:50:9c:d2 brd ff:ff:ff:ff:ff:ff
5: enp39s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b4:05:5d:50:9c:d3 brd ff:ff:ff:ff:ff:ff
(pynest) work@work02:~/xiejiadu/nest_multi_test$ ssh work01
Linux work01 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Nov 3 02:01:50 2020 from 192.168.112.31
work@work01:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp39s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether b4:05:5d:48:4c:42 brd ff:ff:ff:ff:ff:ff
inet 192.168.204.121/24 brd 192.168.204.255 scope global enp39s0f0
valid_lft forever preferred_lft forever
inet6 fe80::b605:5dff:fe48:4c42/64 scope link
valid_lft forever preferred_lft forever
3: enp39s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether b4:05:5d:48:4c:43 brd ff:ff:ff:ff:ff:ff
4: enp39s0f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether b4:05:5d:48:4c:44 brd ff:ff:ff:ff:ff:ff
5: enp39s0f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether b4:05:5d:48:4c:45 brd ff:ff:ff:ff:ff:ff
Yes, the multi-area model runs without any problems.
I have never had such problems. The systems I use are ready and we do not need to worry about mpi communication and such.
Could you try:
/home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01,work02 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py
Here I added -mca btl_tcp_if_include enp39s0f0
. I think this should make tcp use only the enp39s0f0 interface for communication. ip addr
revealed the name of the interface.
It doesn't seem to be good, it's still the previous mistake.
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01,work02 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py
[INFO] [2020.11.3 6:19:35 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 6:19:35 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.3 6:19:35 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 6:19:35 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work02:47789] *** Process received signal ***
[work02:47789] Signal: Aborted (6)
[work02:47789] Signal code: (-6)
[work02:47789] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f3297335730]
[work02:47789] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f32971977bb]
[work02:47789] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f3297182535]
[work02:47789] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7f329718240f]
[work02:47789] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7f3297190102]
[work02:47789] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7f328a069eb9]
[work02:47789] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7f328a05c229]
[work02:47789] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7f328a093666]
[work02:47789] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7f328a052193]
[work02:47789] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7f328a056a32]
[work02:47789] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7f328a056e57]
[work02:47789] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7f328aaa8a40]
[work02:47789] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7f328aea44dc]
[work02:47789] [13] python(+0x1b4924)[0x561faf0fc924]
[work02:47789] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x561faf124bcf]
[work02:47789] [15] python(_PyFunction_Vectorcall+0x1b7)[0x561faf111637]
[work02:47789] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x561faf124e2a]
[work02:47789] [17] python(_PyEval_EvalCodeWithName+0x260)[0x561faf110490]
[work02:47789] [18] python(+0x1f6bb9)[0x561faf13ebb9]
[work02:47789] [19] python(+0x13a23d)[0x561faf08223d]
[work02:47789] [20] python(PyVectorcall_Call+0x6f)[0x561faf0a5f2f]
[work02:47789] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x561faf12a6d1]
[work02:47789] [22] python(_PyEval_EvalCodeWithName+0x260)[0x561faf110490]
[work02:47789] [23] python(_PyFunction_Vectorcall+0x594)[0x561faf111a14]
[work02:47789] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x561faf129583]
[work02:47789] [25] python(_PyFunction_Vectorcall+0x1b7)[0x561faf111637]
[work02:47789] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x561faf124bcf]
[work02:47789] [27] python(_PyFunction_Vectorcall+0x1b7)[0x561faf111637]
[work02:47789] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x561faf124e2a]
[work02:47789] [29] python(_PyFunction_Vectorcall+0x1b7)[0x561faf111637]
[work02:47789] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: work01
PID: 46244
Message: connect() to 192.168.204.122:1024 failed
Error: Operation now in progress (115)
--------------------------------------------------------------------------
[work01:46239] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
-- N E S T --
Copyright (C) 2004 The NEST Initiative
Version: nest-2.18.0
Built: Jan 27 2020 12:49:17
This program is provided AS IS and comes with
NO WARRANTY. See the file LICENSE for details.
Problems or suggestions?
Visit https://www.nest-simulator.org
Type 'nest.help()' to find out more about NEST.
Nov 03 06:19:35 ModelManager::clear_models_ [Info]:
Models will be cleared and parameters reset.
Nov 03 06:19:35 Network::create_rngs_ [Info]:
Deleting existing random number generators
Nov 03 06:19:35 Network::create_rngs_ [Info]:
Creating default RNGs
Nov 03 06:19:35 Network::create_grng_ [Info]:
Creating new default global RNG
Nov 03 06:19:35 RecordingDevice::set_status [Info]:
Data will be recorded to file and to memory.
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 47789 on node work02 exited on signal 6 (Aborted).
The above error is about OpenMPI. In my understanding, OpenMPI can only be multi-threaded on one node, and cannot be used on multiple nodes.Is this distributed multi-node run with the mpirun command?
OpenMPI distributes MPI processes. Here we distribute 2 mpi processes (-np 2) on two nodes. If you specify 4 virtual processes in nest, nest will understand that 2 of those are mpi processes and thus spawn 2 threads on each node.
Do you have a recommended configuration tutorial for the multi node nest simulation environment? It can also be the brief steps of environment configuration and the required installation package.