simulation environment of multi node

jiaduxie commented 3 years ago

Do you have a recommended configuration tutorial for the multi node nest simulation environment? It can also be the brief steps of environment configuration and the required installation package.

jiaduxie commented 3 years ago

I think there is no information transmission in the test program. Is the transmission of the activated neuron impulse function implemented by the bottom layer of nest instead of our programming?

jarsi commented 3 years ago

I am not sure if I understand the question. NEST takes care of transmitting information such as spikes. Nothing we should worry about.

jarsi commented 3 years ago

Do these commands work?

/home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01:2 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py

and

/home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work02:2 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py

Just to make sure that mpi processes can be spawned on single nodes.

jiaduxie commented 3 years ago

On which node does the above command run? It is possible to enable mpi on your own node .(for example, you can execute a on work01 but not b) Did I not fully install MPI? a: /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01:2 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py b: /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work02:2 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py

(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01:2 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py
[INFO] [2020.11.3 7:31:43 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 7:31:43 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 7:31:43 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.3 7:31:43 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG

              -- N E S T --
  Copyright (C) 2004 The NEST Initiative

 Version: nest-2.18.0
 Built: Jan 27 2020 12:49:17

 This program is provided AS IS and comes with
 NO WARRANTY. See the file LICENSE for details.

 Problems or suggestions?
   Visit https://www.nest-simulator.org

 Type 'nest.help()' to find out more about NEST.

              -- N E S T --
  Copyright (C) 2004 The NEST Initiative

 Version: nest-2.18.0
 Built: Jan 27 2020 12:49:17

 This program is provided AS IS and comes with
 NO WARRANTY. See the file LICENSE for details.

 Problems or suggestions?
   Visit https://www.nest-simulator.org

 Type 'nest.help()' to find out more about NEST.

Nov 03 07:31:43 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 03 07:31:43 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 03 07:31:43 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 03 07:31:43 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 03 07:31:43 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 03 07:31:43 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 03 07:31:43 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 03 07:31:43 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 03 07:31:43 RecordingDevice::set_status [Info]: 
    Data will be recorded to file and to memory.

Nov 03 07:31:43 RecordingDevice::set_status [Info]: 
    Data will be recorded to file and to memory.

Nov 03 07:31:43 NodeManager::prepare_nodes [Info]: 
    Preparing 6 nodes for simulation.

Nov 03 07:31:43 NodeManager::prepare_nodes [Info]: 
    Preparing 6 nodes for simulation.

Nov 03 07:31:43 SimulationManager::start_updating_ [Info]: 
    Number of local nodes: 6
    Simulation time (ms): 100

Nov 03 07:31:43 SimulationManager::start_updating_ [Info]: 
    Number of local nodes: 6
    Simulation time (ms): 100
    Number of OpenMP threads: 2
    Number of MPI processes: 2
    Number of OpenMP threads: 2
    Number of MPI processes: 2

Nov 03 07:31:43 SimulationManager::run [Info]: 
    Simulation finished.

Nov 03 07:31:43 SimulationManager::run [Info]: 
    Simulation finished.

(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work02:2 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py
[INFO] [2020.11.3 7:32:47 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 7:32:47 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.3 7:32:47 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.3 7:32:47 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work02:48594] *** Process received signal ***
[work02:48594] Signal: Aborted (6)
[work02:48594] Signal code:  (-6)
[work02:48594] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7fe2c539f730]
[work02:48594] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7fe2c52017bb]
[work02:48594] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7fe2c51ec535]
[work02:48594] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7fe2c51ec40f]
[work02:48594] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7fe2c51fa102]
[work02:48594] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7fe2b80d3eb9]
[work02:48594] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7fe2b80c6229]
[work02:48594] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7fe2b80fd666]
[work02:48594] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7fe2b80bc193]
[work02:48594] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7fe2b80c0a32]
[work02:48594] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7fe2b80c0e57]
[work02:48594] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7fe2b8b12a40]
[work02:48594] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7fe2b8f0e4dc]
[work02:48594] [13] python(+0x1b4924)[0x5597c0cdf924]
[work02:48594] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x5597c0d07bcf]
[work02:48594] [15] python(_PyFunction_Vectorcall+0x1b7)[0x5597c0cf4637]
[work02:48594] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x5597c0d07e2a]
[work02:48594] [17] python(_PyEval_EvalCodeWithName+0x260)[0x5597c0cf3490]
[work02:48594] [18] python(+0x1f6bb9)[0x5597c0d21bb9]
[work02:48594] [19] python(+0x13a23d)[0x5597c0c6523d]
[work02:48594] [20] python(PyVectorcall_Call+0x6f)[0x5597c0c88f2f]
[work02:48594] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x5597c0d0d6d1]
[work02:48594] [22] python(_PyEval_EvalCodeWithName+0x260)[0x5597c0cf3490]
[work02:48594] [23] python(_PyFunction_Vectorcall+0x594)[0x5597c0cf4a14]
[work02:48594] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x5597c0d0c583]
[work02:48594] [25] python(_PyFunction_Vectorcall+0x1b7)[0x5597c0cf4637]
[work02:48594] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x5597c0d07bcf]
[work02:48594] [27] python(_PyFunction_Vectorcall+0x1b7)[0x5597c0cf4637]
[work02:48594] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x5597c0d07e2a]
[work02:48594] [29] python(_PyFunction_Vectorcall+0x1b7)[0x5597c0cf4637]
[work02:48594] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work02:48595] *** Process received signal ***
[work02:48595] Signal: Aborted (6)
[work02:48595] Signal code:  (-6)
[work02:48595] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f10233c3730]
[work02:48595] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f10232257bb]
[work02:48595] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f1023210535]
[work02:48595] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7f102321040f]
[work02:48595] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7f102321e102]
[work02:48595] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7f10160f7eb9]
[work02:48595] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7f10160ea229]
[work02:48595] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7f1016121666]
[work02:48595] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7f10160e0193]
[work02:48595] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7f10160e4a32]
[work02:48595] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7f10160e4e57]
[work02:48595] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7f1016b36a40]
[work02:48595] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7f1016f324dc]
[work02:48595] [13] python(+0x1b4924)[0x5614c1eee924]
[work02:48595] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x5614c1f16bcf]
[work02:48595] [15] python(_PyFunction_Vectorcall+0x1b7)[0x5614c1f03637]
[work02:48595] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x5614c1f16e2a]
[work02:48595] [17] python(_PyEval_EvalCodeWithName+0x260)[0x5614c1f02490]
[work02:48595] [18] python(+0x1f6bb9)[0x5614c1f30bb9]
[work02:48595] [19] python(+0x13a23d)[0x5614c1e7423d]
[work02:48595] [20] python(PyVectorcall_Call+0x6f)[0x5614c1e97f2f]
[work02:48595] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x5614c1f1c6d1]
[work02:48595] [22] python(_PyEval_EvalCodeWithName+0x260)[0x5614c1f02490]
[work02:48595] [23] python(_PyFunction_Vectorcall+0x594)[0x5614c1f03a14]
[work02:48595] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x5614c1f1b583]
[work02:48595] [25] python(_PyFunction_Vectorcall+0x1b7)[0x5614c1f03637]
[work02:48595] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x5614c1f16bcf]
[work02:48595] [27] python(_PyFunction_Vectorcall+0x1b7)[0x5614c1f03637]
[work02:48595] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x5614c1f16e2a]
[work02:48595] [29] python(_PyFunction_Vectorcall+0x1b7)[0x5614c1f03637]
[work02:48595] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 48594 on node work02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

jiaduxie commented 3 years ago

Do I want to treat one node as a control node to control the calculations or running programs of other nodes? The test program now running on the work01 node.

jarsi commented 3 years ago

Both commands run 2 mpi processes. In a) on node work01, in b) on node work02. On work01 it succeeds, on work02 it does not work. The question is, what is the difference between both nodes?

One idea I just had: Maybe it helps if you source conda and activate your environment in your ~/.bashrc I.e. add this to your ~/.bashrc:

source /path/to/conda.sh
conda activate pynest

And now try again the two commands.

jiaduxie commented 3 years ago

It is not feasible to run these two scripts.I found it strange, it is no problem to execute this test script on each local machine.(this ok：in work01 run a,inwork02 run b;but Cross-runs will produce errors： in work01 run b,in work02 run a)Two machines cannot call each other to run. Can you be sure that the command to run ismpirun -np 2 -host work01,work02 python ./multi_nest.py?What is your execution script on your machine, can you give me a reference?Can you run this test script on your machine？

multi_test.py：
from nest import *
SetKernelStatus({"total_num_virtual_procs": 4})
pg = Create("poisson_generator", params={"rate": 50000.0})
n = Create("iaf_psc_alpha", 4)
sd = Create("spike_detector", params={"to_file": True})
Connect(pg, [n[0]], syn_spec={'weight': 1000.0, 'delay': 1.0})
Connect([n[0]], [n[1]], syn_spec={'weight': 1000.0, 'delay': 1.0})
Connect([n[1]], [n[2]], syn_spec={'weight': 1000.0, 'delay': 1.0})
Connect([n[2]], [n[3]], syn_spec={'weight': 1000.0, 'delay': 1.0})
Connect(n, sd)
Simulate(100.0)

in work01 run b
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work02:2 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py
[INFO] [2020.11.4 2:15:42 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.4 2:15:42 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.4 2:15:42 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.4 2:15:42 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work02:51926] *** Process received signal ***
[work02:51926] Signal: Aborted (6)
[work02:51926] Signal code:  (-6)
[work02:51926] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f9b06579730]
[work02:51926] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f9b063db7bb]
[work02:51926] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f9b063c6535]
[work02:51926] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7f9b063c640f]
[work02:51926] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7f9b063d4102]
[work02:51926] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7f9af92adeb9]
[work02:51926] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7f9af92a0229]
[work02:51926] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7f9af92d7666]
[work02:51926] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7f9af9296193]
[work02:51926] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7f9af929aa32]
[work02:51926] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7f9af929ae57]
[work02:51926] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7f9af9ceca40]
[work02:51926] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7f9afa0e84dc]
[work02:51926] [13] python(+0x1b4924)[0x56228694b924]
[work02:51926] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x562286973bcf]
[work02:51926] [15] python(_PyFunction_Vectorcall+0x1b7)[0x562286960637]
[work02:51926] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x562286973e2a]
[work02:51926] [17] python(_PyEval_EvalCodeWithName+0x260)[0x56228695f490]
[work02:51926] [18] python(+0x1f6bb9)[0x56228698dbb9]
[work02:51926] [19] python(+0x13a23d)[0x5622868d123d]
[work02:51926] [20] python(PyVectorcall_Call+0x6f)[0x5622868f4f2f]
[work02:51926] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x5622869796d1]
[work02:51926] [22] python(_PyEval_EvalCodeWithName+0x260)[0x56228695f490]
[work02:51926] [23] python(_PyFunction_Vectorcall+0x594)[0x562286960a14]
[work02:51926] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x562286978583]
[work02:51926] [25] python(_PyFunction_Vectorcall+0x1b7)[0x562286960637]
[work02:51926] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x562286973bcf]
[work02:51926] [27] python(_PyFunction_Vectorcall+0x1b7)[0x562286960637]
[work02:51926] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x562286973e2a]
[work02:51926] [29] python(_PyFunction_Vectorcall+0x1b7)[0x562286960637]
[work02:51926] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work02:51927] *** Process received signal ***
[work02:51927] Signal: Aborted (6)
[work02:51927] Signal code:  (-6)
[work02:51927] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7ff6a9b41730]
[work02:51927] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7ff6a99a37bb]
[work02:51927] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7ff6a998e535]
[work02:51927] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7ff6a998e40f]
[work02:51927] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7ff6a999c102]
[work02:51927] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7ff69c875eb9]
[work02:51927] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7ff69c868229]
[work02:51927] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7ff69c89f666]
[work02:51927] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7ff69c85e193]
[work02:51927] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7ff69c862a32]
[work02:51927] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7ff69c862e57]
[work02:51927] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7ff69d2b4a40]
[work02:51927] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7ff69d6b04dc]
[work02:51927] [13] python(+0x1b4924)[0x55b3385d7924]
[work02:51927] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x55b3385ffbcf]
[work02:51927] [15] python(_PyFunction_Vectorcall+0x1b7)[0x55b3385ec637]
[work02:51927] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x55b3385ffe2a]
[work02:51927] [17] python(_PyEval_EvalCodeWithName+0x260)[0x55b3385eb490]
[work02:51927] [18] python(+0x1f6bb9)[0x55b338619bb9]
[work02:51927] [19] python(+0x13a23d)[0x55b33855d23d]
[work02:51927] [20] python(PyVectorcall_Call+0x6f)[0x55b338580f2f]
[work02:51927] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x55b3386056d1]
[work02:51927] [22] python(_PyEval_EvalCodeWithName+0x260)[0x55b3385eb490]
[work02:51927] [23] python(_PyFunction_Vectorcall+0x594)[0x55b3385eca14]
[work02:51927] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x55b338604583]
[work02:51927] [25] python(_PyFunction_Vectorcall+0x1b7)[0x55b3385ec637]
[work02:51927] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x55b3385ffbcf]
[work02:51927] [27] python(_PyFunction_Vectorcall+0x1b7)[0x55b3385ec637]
[work02:51927] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x55b3385ffe2a]
[work02:51927] [29] python(_PyFunction_Vectorcall+0x1b7)[0x55b3385ec637]
[work02:51927] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 51926 on node work02 exited on signal 6 (Aborted).

in work02 run a
(pynest) work@work02:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01:2 -mca btl_tcp_if_include enp39s0f0 python /home/work/xiejiadu/nest_multi_test/multi_test.py
[INFO] [2020.11.4 2:18:5 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.4 2:18:5 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.4 2:18:5 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.4 2:18:5 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work01:49930] *** Process received signal ***
[work01:49930] Signal: Aborted (6)
[work01:49930] Signal code:  (-6)
[work01:49930] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f1c2b124730]
[work01:49930] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f1c2af867bb]
[work01:49930] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f1c2af71535]
[work01:49930] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7f1c2af7140f]
[work01:49930] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7f1c2af7f102]
[work01:49930] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7f1c1de69eb9]
[work01:49930] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7f1c1de5c229]
[work01:49930] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7f1c1de93666]
[work01:49930] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7f1c1de52193]
[work01:49930] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7f1c1de56a32]
[work01:49930] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7f1c1de56e57]
[work01:49930] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7f1c1e8a8a40]
[work01:49930] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7f1c1eca44dc]
[work01:49930] [13] python(+0x1b4924)[0x558014b12924]
[work01:49930] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x558014b3abcf]
[work01:49930] [15] python(_PyFunction_Vectorcall+0x1b7)[0x558014b27637]
[work01:49930] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x558014b3ae2a]
[work01:49930] [17] python(_PyEval_EvalCodeWithName+0x260)[0x558014b26490]
[work01:49930] [18] python(+0x1f6bb9)[0x558014b54bb9]
[work01:49930] [19] python(+0x13a23d)[0x558014a9823d]
[work01:49930] [20] python(PyVectorcall_Call+0x6f)[0x558014abbf2f]
[work01:49930] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x558014b406d1]
[work01:49930] [22] python(_PyEval_EvalCodeWithName+0x260)[0x558014b26490]
[work01:49930] [23] python(_PyFunction_Vectorcall+0x594)[0x558014b27a14]
[work01:49930] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x558014b3f583]
[work01:49930] [25] python(_PyFunction_Vectorcall+0x1b7)[0x558014b27637]
[work01:49930] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x558014b3abcf]
[work01:49930] [27] python(_PyFunction_Vectorcall+0x1b7)[0x558014b27637]
[work01:49930] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x558014b3ae2a]
[work01:49930] [29] python(_PyFunction_Vectorcall+0x1b7)[0x558014b27637]
[work01:49930] *** End of error message ***
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work01:49931] *** Process received signal ***
[work01:49931] Signal: Aborted (6)
[work01:49931] Signal code:  (-6)
[work01:49931] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f05591e6730]
[work01:49931] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f05590487bb]
[work01:49931] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f0559033535]
[work01:49931] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7f055903340f]
[work01:49931] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7f0559041102]
[work01:49931] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7f054bf2beb9]
[work01:49931] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7f054bf1e229]
[work01:49931] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7f054bf55666]
[work01:49931] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7f054bf14193]
[work01:49931] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7f054bf18a32]
[work01:49931] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7f054bf18e57]
[work01:49931] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7f054c96aa40]
[work01:49931] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7f054cd664dc]
[work01:49931] [13] python(+0x1b4924)[0x555babb15924]
[work01:49931] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x555babb3dbcf]
[work01:49931] [15] python(_PyFunction_Vectorcall+0x1b7)[0x555babb2a637]
[work01:49931] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x555babb3de2a]
[work01:49931] [17] python(_PyEval_EvalCodeWithName+0x260)[0x555babb29490]
[work01:49931] [18] python(+0x1f6bb9)[0x555babb57bb9]
[work01:49931] [19] python(+0x13a23d)[0x555baba9b23d]
[work01:49931] [20] python(PyVectorcall_Call+0x6f)[0x555bababef2f]
[work01:49931] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x555babb436d1]
[work01:49931] [22] python(_PyEval_EvalCodeWithName+0x260)[0x555babb29490]
[work01:49931] [23] python(_PyFunction_Vectorcall+0x594)[0x555babb2aa14]
[work01:49931] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x555babb42583]
[work01:49931] [25] python(_PyFunction_Vectorcall+0x1b7)[0x555babb2a637]
[work01:49931] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x555babb3dbcf]
[work01:49931] [27] python(_PyFunction_Vectorcall+0x1b7)[0x555babb2a637]
[work01:49931] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x555babb3de2a]
[work01:49931] [29] python(_PyFunction_Vectorcall+0x1b7)[0x555babb2a637]
[work01:49931] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 49930 on node work01 exited on signal 6 (Aborted).

jiaduxie commented 3 years ago

Do you have better distributed and parallel information about NEST?

jarsi commented 3 years ago

You can check this guide

I am quite certain that the problem is not NEST related. I think that the nodes are not setup correctly. But I do not have any experience with this. So you might want to follow a guide on how to run MPI on a compute cluster and setup you servers accordingly.

All parallel scripts run fine on the clusters I have access to. They are dedicated computing clusters and everything is setup and installed by a system administrator. He makes sure that the nodes and MPI work.

I think you don't need to ssh onto the nodes in order to run the jobs, as you did in your previous example. You should be able to run the jobs from the head node.

jiaduxie commented 3 years ago

I don’t understand that the submission here must be in the form of a script. Can you explain it to me? Can you show me the content of config.py of multi-area-model?

NEST GUIDE: Distributed simulations cannot be run interactively, which means that the simulation has to be provided as a script. However, the script can be the same as a script for any simulation. No changes are necessary for distibuted simulation scripts: inter-process communication and node distribution is managed transparently inside of NEST.

To distribute a simulation onto 128 processes of a computer cluster, the command should look like this

mpirun -np 128 python simulation.py Please refer to the MPI library documentation for details on the usage of mpirun.

jiaduxie commented 3 years ago

Have you tried to run multi-area-model with CONDA's NEST before?

jarsi commented 3 years ago

Most clusters use a resource manager (like SLURM). Here you must submit a script. The scripts gets executed once there are enough resources. You do not have a resource manager. So you do not need to submit a script. It is sufficient to run the model from the command line. The content is not general. Everyone needs to have his or her own config.py. It mainly depends on the scheduler.

Yes, also conda NEST works with the multi-area model. But I usually use compiled NEST because it can be optimized for the hardware that you use. But in general, conda NEST works.

jiaduxie commented 3 years ago

It may be enlightening to read your config.py file.I installed NEST under CONDA, and I don’t know what else to install. Even, each of my machines is equipped with password-free login (ssh work_num), and there is nothing wrong with running commands and test codes.

jiaduxie commented 3 years ago

You have also run NEST under conda. Did your supercomputer already have conda, or did you install it yourself?

jarsi commented 3 years ago

I still think that the cluster causes the problem. As I stated before, I have only very limited experience with this. But one way to check would be to compile a simple C program and test whether it succeeds. If it doesn't, this would be a prove that the problems are not caused by NEST nor Conda.

I am following some guide I found on the internet. This example program prints the number of the MPI rank and name of the host. You can run it across nodes. If this does not work, I'd suggest you try to find out the reasons behind it.

What happens if you compile and run the following code:

Code:

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char **argv)
{
  int rank;
  char hostname[256];

  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  gethostname(hostname,255);
  printf("Hello from process %3d on host %s\n", rank, hostname);
  MPI_Finalize();
  return 0;
}

Compilation and running:

mpicc -o mpitest mpitest.c
/home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01,work02 -mca btl_tcp_if_include enp39s0f0 ./mpitest

config.py for completeness. Keep in mind that this is setup for SLURM (all the sbatch commands). Your individual config.py will be different.

base_path = '/users/home/gitordner/multi-area-model'

# Place to store simulations
data_path = '/users/home/gitordner/multi-area-model/simulations'

# Template for jobscripts
jobscript_template = """#!/bin/bash -x
#SBATCH --job-name MAM_4g
#SBATCH -o {sim_dir}/{label}.%j.o
#SBATCH -e {sim_dir}/{label}.%j.e
#SBATCH --mem=96G
#SBATCH --time=24:00:00
#SBATCH --exclusive
#SBATCH --partition blaustein
#SBATCH --cpus-per-task={local_num_threads}
#SBATCH --ntasks={num_processes}
#SBATCH --nodes={num_nodes}

. "/users/home/miniconda3/etc/profile.d/conda.sh"
conda activate mam

mpirun python -u {base_path}/run_simulation.py {label} {network_label}"""

# Command to submit jobs on the local cluster
submit_cmd = 'sbatch'

I installed conda myself following the official installation guide.

jiaduxie commented 3 years ago

I use mpicc under conda to compile the mpitest.c file and report an error, but the system mpicc compiles the file. There is no problem with the implementation, it seems that mpi can be used.But I find it a bit strange that the ranks of work01 and work02 are both 0.

(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpicc -o mpitest mpitest.c 
--------------------------------------------------------------------------
The Open MPI wrapper compiler was unable to find the specified compiler
x86_64-conda-linux-gnu-cc in your PATH.

Note that this compiler was either specified at configure time or in
one of several possible environment variables.
--------------------------------------------------------------------------
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /usr/bin/mpicc -o mpitest mpitest.c 
(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work01,work02 -mca btl_tcp_if_include enp39s0f0 ./mpitest
Hello from process   0 on host work02
Hello from process   0 on host work01

jiaduxie commented 3 years ago

I also wrote a python program to measure mpi.

in work01:mpipython.py

from mpi4py import MPI
print( "work01, my rank is %d" % MPI.COMM_WORLD.Get_rank() )

in work02:mpipython.py

from mpi4py import MPI
print( "work02, my rank is %d" % MPI.COMM_WORLD.Get_rank() )

in work03:mpipython.py

from mpi4py import MPI
print( "work03, my rank is %d" % MPI.COMM_WORLD.Get_rank() )

Run the program：

(pynest) work@work01:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 6 -host work01:2,work02:2,work03:2 -mca btl_tcp_if_include enp39s0f0 python ./mpipython.py 
work01, my rank is 0
work01, my rank is 1
work02, my rank is 3
work02, my rank is 2
work03, my rank is 4
work03, my rank is 5

jiaduxie commented 3 years ago

I think there is no problem with mpi.Do you think it might be a problem with the test program multi_test.py? Maybe this program can only run on one machine.

from nest import *
SetKernelStatus({"total_num_virtual_procs": 4})
pg = Create("poisson_generator", params={"rate": 50000.0})
n = Create("iaf_psc_alpha", 4)
sd = Create("spike_detector", params={"to_file": True})
print("My Rank is :{}".format(Rank()))
Connect(pg, [n[0]], syn_spec={'weight': 1000.0, 'delay': 1.0})
Connect([n[0]], [n[1]], syn_spec={'weight': 1000.0, 'delay': 1.0})
Connect([n[1]], [n[2]], syn_spec={'weight': 1000.0, 'delay': 1.0})
Connect([n[2]], [n[3]], syn_spec={'weight': 1000.0, 'delay': 1.0})
Connect(n, sd)
Simulate(100.0)

jarsi commented 3 years ago

In your conda environment you also need to install the compiler: conda install openmpi-mpicc

The reason that the system mpicc compiles it but running it with conda mpirun leads to 2 times rank 0 means, that it started 2 independent jobs. The reason is that the versions are different. Versions always need to be the same. You cannot mix conda and system libraries.

What do you mean with in workX? Are these different files?

in work01:mpipython.py
in work02:mpipython.py
in work03:mpipython.py

Most probable there is no problem with this test program. There is no reason that it should not be able to run on several nodes. In fact NEST has succesfully run on up to ~86000 nodes. See for example jordan et al 2018.

jiaduxie commented 3 years ago

I searched for mpicc in the conda environment, but I can’t use it.mpipython.py is actually the same. The difference is that I added the name of my node to the program.When I run the mpirun program, do I need the same mpipython.py file in the same directory of each node?Or I want to mount the work01 directory on the work02 and work03 nodes, so that work02 and work03 share the work01 files.

jiaduxie commented 3 years ago

My current approach is to put the same mpipython.py file on each node, I don’t know if they run independently.

jiaduxie commented 3 years ago

The current error seems to be a problem with the nest library?

(pynest) work@work02:~/xiejiadu/nest_multi_test$ /home/work/anaconda3/envs/pynest/bin/mpirun -np 2 -host work02,work03 python -u /home/work/xiejiadu/nest_multi_test/multi_test.py 
[INFO] [2020.11.6 2:41:3 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.6 2:41:3 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
[INFO] [2020.11.6 2:41:3 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.6 2:41:3 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG
python: /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/sli/scanner.cc:581: bool Scanner::operator()(Token&): Assertion `in->good()' failed.
[work03:57073] *** Process received signal ***
[work03:57073] Signal: Aborted (6)
[work03:57073] Signal code:  (-6)
[work03:57073] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7fa418869730]
[work03:57073] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7fa4186cb7bb]
[work03:57073] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7fa4186b6535]
[work03:57073] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2240f)[0x7fa4186b640f]
[work03:57073] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x30102)[0x7fa4186c4102]
[work03:57073] [ 5] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN7ScannerclER5Token+0x1489)[0x7fa40b597eb9]
[work03:57073] [ 6] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN6ParserclER5Token+0x49)[0x7fa40b58a229]
[work03:57073] [ 7] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZNK14IparseFunction7executeEP14SLIInterpreter+0x96)[0x7fa40b5c1666]
[work03:57073] [ 8] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(+0x74193)[0x7fa40b580193]
[work03:57073] [ 9] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7fa40b584a32]
[work03:57073] [10] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7startupEv+0x27)[0x7fa40b584e57]
[work03:57073] [11] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/../../../libnest.so(_Z11neststartupPiPPPcR14SLIInterpreterNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1ea0)[0x7fa40bfd6a40]
[work03:57073] [12] /home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/pynestkernel.so(+0x444dc)[0x7fa40c3d24dc]
[work03:57073] [13] python(+0x1b4924)[0x55f55b130924]
[work03:57073] [14] python(_PyEval_EvalFrameDefault+0x4bf)[0x55f55b158bcf]
[work03:57073] [15] python(_PyFunction_Vectorcall+0x1b7)[0x55f55b145637]
[work03:57073] [16] python(_PyEval_EvalFrameDefault+0x71a)[0x55f55b158e2a]
[work03:57073] [17] python(_PyEval_EvalCodeWithName+0x260)[0x55f55b144490]
[work03:57073] [18] python(+0x1f6bb9)[0x55f55b172bb9]
[work03:57073] [19] python(+0x13a23d)[0x55f55b0b623d]
[work03:57073] [20] python(PyVectorcall_Call+0x6f)[0x55f55b0d9f2f]
[work03:57073] [21] python(_PyEval_EvalFrameDefault+0x5fc1)[0x55f55b15e6d1]
[work03:57073] [22] python(_PyEval_EvalCodeWithName+0x260)[0x55f55b144490]
[work03:57073] [23] python(_PyFunction_Vectorcall+0x594)[0x55f55b145a14]
[work03:57073] [24] python(_PyEval_EvalFrameDefault+0x4e73)[0x55f55b15d583]
[work03:57073] [25] python(_PyFunction_Vectorcall+0x1b7)[0x55f55b145637]
[work03:57073] [26] python(_PyEval_EvalFrameDefault+0x4bf)[0x55f55b158bcf]
[work03:57073] [27] python(_PyFunction_Vectorcall+0x1b7)[0x55f55b145637]
[work03:57073] [28] python(_PyEval_EvalFrameDefault+0x71a)[0x55f55b158e2a]
[work03:57073] [29] python(_PyFunction_Vectorcall+0x1b7)[0x55f55b145637]
[work03:57073] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: work02
  PID:        68440
  Message:    connect() to 192.168.204.123:1024 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------
[work02:68435] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193

              -- N E S T --
  Copyright (C) 2004 The NEST Initiative

 Version: nest-2.18.0
 Built: Jan 27 2020 12:49:17

 This program is provided AS IS and comes with
 NO WARRANTY. See the file LICENSE for details.

 Problems or suggestions?
   Visit https://www.nest-simulator.org

 Type 'nest.help()' to find out more about NEST.

Nov 06 02:41:03 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 06 02:41:03 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 06 02:41:03 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 06 02:41:03 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 06 02:41:03 RecordingDevice::set_status [Info]: 
    Data will be recorded to file and to memory.
My Rank is :0
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 57073 on node work03 exited on signal 6 (Aborted).

jarsi commented 3 years ago

As stated before you need to conda install mpicc.

Also the error output has not changed. In my understanding nest cannot work because:

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

This is not a nest problem. It is mpi. Can you please prove that any other mpi test program successfully communicates across nodes? See for example the mpi test I posted before.

Furthermore, it seems weird that you need to put the file onto the nodes. I think they should be able to access the same file, not copies.

Does anyone else in you group use this server to run jobs across nodes? Does it work for them?

jiaduxie commented 3 years ago

I check that my current environment has mpicc.

(pynest) work@work01:~/xiejiadu/nest_multi_test$ which mpicc
/home/work/anaconda3/envs/pynest/bin/mpicc

I used conda install openmpi-mpicc to install mpicc when I got an error.

(pynest) work@work01:~/xiejiadu/nest_multi_test$ conda install openmpi-mpicc
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: | 
Found conflicts! Looking for incompatible packages.
iled                                                                                                                                                                                                                                                                                     

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package pip conflicts for:
wcwidth -> python -> pip
numpy -> python[version='>=3.8,<3.9.0a0'] -> pip
pip
python_abi -> python=3.8 -> pip
setuptools -> python[version='>=3.7,<3.8.0a0'] -> pip
decorator -> python -> pip
parso -> python -> pip
backcall -> python -> pip
wheel -> python -> pip
pandas -> python[version='>=3.6,<3.7.0a0'] -> pip
ptyprocess -> python[version='>=3.8,<3.9.0a0'] -> pip
nest-simulator==2.18 -> python[version='>=3.8,<3.9.0a0'] -> pip
certifi -> python -> pip
prompt-toolkit -> python[version='>=3.6'] -> pip
patsy -> python[version='>=3.8,<3.9.0a0'] -> pip
mpi4py -> python[version='>=3.7,<3.8.0a0'] -> pip
python-dateutil -> python -> pip
jedi -> python[version='>=3.8,<3.9.0a0'] -> pip
six -> python -> pip
pygments -> python[version='>=3.5'] -> pip
ipython -> python[version='>=3.7,<3.8.0a0'] -> pip
statsmodels -> python[version='>=3.7,<3.8.0a0'] -> pip
python=3.8 -> pip
pickleshare -> python[version='>=3.7,<3.8.0a0'] -> pip
pytz -> python -> pip
cython -> python[version='>=3.6,<3.7.0a0'] -> pip
traitlets -> python[version='>=3.7'] -> pip
ipython_genutils -> python[version='>=3.8,<3.9.0a0'] -> pip
scipy -> python[version='>=3.6,<3.7.0a0'] -> pip

Package parso conflicts for:
parso
jedi -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0|>=0.3.0|>=0.5.0|>=0.5.2|>=0.7.0|>=0.7.0,<0.8.0']
ipython -> jedi[version='>=0.10'] -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0|>=0.3.0|>=0.5.0|>=0.5.2|>=0.7.0|>=0.7.0,<0.8.0']

Package libgcc-ng conflicts for:

jiaduxie commented 3 years ago

I am not sure what the structure of the supercomputer and the operation scheduling process are like?My cluster environment here is built with nine ordinary servers (linux, debain), connected to the same LAN through a network cable.Someone in my group has used mpi, but he uses the MPICH release version. And their program communication is written by their own communication process, the nine nodes are running together, and each node has an executable file.

jiaduxie commented 3 years ago

The openmpi used by NEST is bundled with NEST when it is installed and compiled, and cannot be installed separately. I am not sure if the understanding is correct.

jiaduxie commented 3 years ago

I may be able to install the SLURM system recently. Is it enough to just put the program on the control node to run the model? Do other computing nodes keep the program?

jarsi commented 3 years ago

When I wanted to compile something inside the environment it did not work. I googled the error message and somewhere they suggested to conda install openmpi-mpicc which fixed the problem. Conda should take care of versioning, so you do not end up with mismatching versions.

I have never installed SLURM and I don't know how to do it. I suggest you follow a guide, there probably are some online.

jiaduxie commented 3 years ago

Hi,jarsi,I am now going to do an experiment(1 million neurons, 1 billion synapses) to compare speed, simulating a macaque brain with 80 threads on a single node.But I see in the background that my server has only one core running.The server configuration is 4 CPUs, a total of 176 cores.Look at the following configuration and running commands are there any errors? The command to run is: python run_example_downscaled.py

run_example_downscaled.py：

import numpy as np
import os
from multiarea_model import MultiAreaModel
from config import base_path

"""
Down-scaled model.
Neurons and indegrees are both scaled down to 10 %.
Can usually be simulated on a local machine.

Warning: This will not yield reasonable dynamical results from the
network and is only meant to demonstrate the simulation workflow.
"""
d = {}
conn_params = {'replace_non_simulated_areas': 'het_poisson_stat',
               'g': -11.,
               'K_stable': 'K_stable.npy',
               'fac_nu_ext_TH': 1.2,
               'fac_nu_ext_5E': 1.125,
               'fac_nu_ext_6E': 1.41666667,
               'av_indegree_V1': 3950.}
input_params = {'rate_ext': 10.}
neuron_params = {'V0_mean': -150.,
                 'V0_sd': 50.}
network_params = {'N_scaling': 0.243,
                  'K_scaling': 0.172,
                  'fullscale_rates': os.path.join(base_path, 'tests/fullscale_rates.json'),
                  'input_params': input_params,
                  'connection_params': conn_params,
                  'neuron_params': neuron_params}

sim_params = {'t_sim':1000.,
              'num_processes': 1,
              'local_num_threads': 80,
              'recording_dict': {'record_vm': False}}

theory_params = {'dt': 0.1}

M = MultiAreaModel(network_params, simulation=True,
                   sim_spec=sim_params,
                   theory=True,
                   theory_spec=theory_params)
#p, r = M.theory.integrate_siegert()
#print("Mean-field theory predicts an average "
#      "rate of {0:.3f} spikes/s across all populations.".format(np.mean(r[:, -1])))
M.simulation.simulate()

I found that the process of creating the network is very long, taking 10 hours to complete (reached the FEF area).

(pynest) work@work01:~/xiejiadu/multi-area-model-master$ python run_example_downscaled.py 
[INFO] [2020.11.11 8:28:54 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs_] : Creating default RNGs
[INFO] [2020.11.11 8:28:54 /home/conda/feedstock_root/build_artifacts/nest-simulator_1580129123254/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG

              -- N E S T --
  Copyright (C) 2004 The NEST Initiative

 Version: nest-2.18.0
 Built: Jan 27 2020 12:49:17

 This program is provided AS IS and comes with
 NO WARRANTY. See the file LICENSE for details.

 Problems or suggestions?
   Visit https://www.nest-simulator.org

 Type 'nest.help()' to find out more about NEST.

Initializing network from dictionary.
RAND_DATA_LABEL 3163
/home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3372: RuntimeWarning:Mean of empty slice.
/home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/numpy/core/_methods.py:170: RuntimeWarning:invalid value encountered in double_scalars
No R installation, taking hard-coded SLN fit parameters.

========================================
Customized parameters
--------------------
{'K_scaling': 0.172,
 'N_scaling': 0.242,
 'connection_params': {'K_stable': 'K_stable.npy',
                       'av_indegree_V1': 3950.0,
                       'fac_nu_ext_5E': 1.125,
                       'fac_nu_ext_6E': 1.41666667,
                       'fac_nu_ext_TH': 1.2,
                       'g': -11.0,
                       'replace_non_simulated_areas': 'het_poisson_stat'},
 'fullscale_rates': '/home/work/xiejiadu/multi-area-model-master/tests/fullscale_rates.json',
 'input_params': {'rate_ext': 10.0},
 'neuron_params': {'V0_mean': -150.0, 'V0_sd': 50.0}}
========================================
/home/work/xiejiadu/multi-area-model-master/multiarea_model/data_multiarea/Model.py:878: FutureWarning:`rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
/home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/dicthash/dicthash.py:47: UserWarning:Float too small for safe conversion tointeger. Rounding down to zero.
Simulation label: 56723132566eedf72eac06b792c47214
Copied files.
Initialized simulation class.

Nov 11 08:29:07 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 11 08:29:07 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 11 08:29:07 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 11 08:29:07 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 11 08:29:07 ModelManager::clear_models_ [Info]: 
    Models will be cleared and parameters reset.

Nov 11 08:29:07 Network::create_rngs_ [Info]: 
    Deleting existing random number generators

Nov 11 08:29:07 Network::create_rngs_ [Info]: 
    Creating default RNGs

Nov 11 08:29:07 Network::create_grng_ [Info]: 
    Creating new default global RNG

Nov 11 08:29:07 SimulationManager::set_status [Info]: 
    Temporal resolution changed.
Prepared simulation in 0.24 seconds.
/home/work/anaconda3/envs/pynest/lib/python3.8/site-packages/nest/lib/hl_api_helper.py:127: UserWarning:
GetNodes is deprecated and will be removed in NEST 3.0. Use GIDCollection instead.

Nov 11 08:29:07 SLIInterpreter [Deprecated]: 
    SLI function GetNodes_i_D_b_b is deprecated in NEST 3.0.
Rank 0: created area V1 with 47897 local nodes
Memory after V1 : 23771.17 MB
Rank 0: created area V2 with 38011 local nodes
Memory after V2 : 23892.12 MB
Rank 0: created area VP with 41622 local nodes
Memory after VP : 29039.57 MB
Rank 0: created area V3 with 40281 local nodes
Memory after V3 : 29179.27 MB
Rank 0: created area V3A with 28055 local nodes
Memory after V3A : 29244.96 MB
Rank 0: created area MT with 36132 local nodes
Memory after MT : 34384.97 MB
Rank 0: created area V4t with 35038 local nodes
Memory after V4t : 34517.35 MB
Rank 0: created area V4 with 38492 local nodes
Memory after V4 : 39662.64 MB
Rank 0: created area VOT with 35647 local nodes
Memory after VOT : 39795.02 MB
Rank 0: created area MSTd with 30080 local nodes
Memory after MSTd : 44924.17 MB
Rank 0: created area PIP with 30080 local nodes
Memory after PIP : 45046.91 MB
Rank 0: created area PO with 30080 local nodes
Memory after PO : 45114.64 MB
Rank 0: created area DP with 28492 local nodes
Memory after DP : 50239.99 MB
Rank 0: created area MIP with 29959 local nodes
Memory after MIP : 50370.19 MB
Rank 0: created area MDP with 30080 local nodes
Memory after MDP : 50444.24 MB
Rank 0: created area VIP with 30805 local nodes
Memory after VIP : 55580.21 MB
Rank 0: created area LIP with 33835 local nodes
Memory after LIP : 55701.65 MB
Rank 0: created area PITv with 35647 local nodes
Memory after PITv : 60843.73 MB
Rank 0: created area PITd with 35647 local nodes
Memory after PITd : 60976.15 MB
Rank 0: created area MSTl with 30080 local nodes
Memory after MSTl : 66115.12 MB
Rank 0: created area CITv with 26845 local nodes
Memory after CITv : 66212.96 MB
Rank 0: created area CITd with 26845 local nodes
Memory after CITd : 66280.29 MB
Rank 0: created area FEF with 30368 local nodes
Memory after FEF : 71407.92 MB

jiaduxie commented 3 years ago

Hi jarsi, can you answer me？

jarsi commented 3 years ago

10 hours is very long. It shouldn't be this long.

I think this is not a NEST or multi-area model problem. From my experience this can happen if all (or some) threads run on the same core (or cores). This induces serious overhead because the threads need to switch. You can check if this is the case if you ssh onto the core and run htop. If as many cores do work as you run threads, good. If not then this is the problem.

Have you made progress on the mpi task distribution across nodes?

jarsi commented 3 years ago

On a second thought. Maybe the long build time is indeed related to NEST. it makes sense to test different combinations of MPI processes and threads. I have seen good results with for example 4 or 6 threads per mpi procsess. In my experience hyperthreading should be avoided, I haven't seen any gains, only performance losses. Assuming all 176 cores are independent physical cores you could try: 44 Mpi Processes with 4 threads each

jiaduxie commented 3 years ago

![Uploading htop.png…]()

jiaduxie commented 3 years ago

What is the difference between process and thread? During the simulation process, does one thread create the network or are multiple threads created together? I don't understand the reason?

jiaduxie commented 3 years ago

What command is running? htop.docx

jarsi commented 3 years ago

Process usually means MPI process or MPI tasks. Processes and threads are ways to parallelize computations. A process has its own private memory and communicates via MPI packets with other processes. Threads that belong to a process share the memory and thus do not need to communicate via packets. Both ways of parallelization have their pros and cons. MPI is needed for communication across nodes, threads cannot do this. On the nodes itself you can use either processes, threads or a mixture. The perfect balance is often not clear and can be experimentally tested.

jarsi commented 3 years ago

According to the image only one process does all the work. You can try for example num_processes=20 and local_num_threads=4. In total this is 4*20=80.

jiaduxie commented 3 years ago

But do you want to start mpirun to run? E.g mpirun -np 20 python run_example_downscaled.py

jiaduxie commented 3 years ago

I have seen the code of simulation.py, it doesn't matter how much num_processes and local_num_thread are, just pay attention to the result of their multiplication (total_num_virtual_procs).

jarsi commented 3 years ago

I think if you use mpi you need to proceed as described in run_example_fullscale.py.

Adjust the config.py, try something like this

# Absolut path of repository
base_path = '/home/users/multi-area-model'

# Place to store simulations
data_path = '/home/users/multi-area-model/simulation'

# Template for job scripts
jobscript_template = '''
# Instruction for the queuing system

. /home/users/miniconda3/etc/profile.d/conda.sh
conda activate multi_area_model

mpirun -np {num_processes} python {base_path}/run_simulation.py {label} {network_label}'''

# Command to submit jobs on the local cluster
submit_cmd = 'bash'

Then you need to adjust your jobscript. It should look something like this:

import numpy as np
import os
from start_jobs import start_job
from config import submit_cmd, jobscript_template
from multiarea_model import MultiAreaModel
from config import base_path

"""
Down-scaled model.
Neurons and indegrees are both scaled down to 10 %.
Can usually be simulated on a local machine.

Warning: This will not yield reasonable dynamical results from the
network and is only meant to demonstrate the simulation workflow.
"""
d = {}
conn_params = {'replace_non_simulated_areas': 'het_poisson_stat',
               'g': -11.,
               'K_stable': 'K_stable.npy',
               'fac_nu_ext_TH': 1.2,
               'fac_nu_ext_5E': 1.125,
               'fac_nu_ext_6E': 1.41666667,
               'av_indegree_V1': 3950.}
input_params = {'rate_ext': 10.}
neuron_params = {'V0_mean': -150.,
                 'V0_sd': 50.}
network_params = {'N_scaling': 0.243,
                  'K_scaling': 0.172,
                  'fullscale_rates': os.path.join(base_path, 'tests/fullscale_rates.json'),
                  'input_params': input_params,
                  'connection_params': conn_params,
                  'neuron_params': neuron_params}

sim_params = {'t_sim':1000.,
              'num_processes': 20,
              'local_num_threads': 4,
              'recording_dict': {'record_vm': False}}

theory_params = {'dt': 0.1}

M = MultiAreaModel(network_params, simulation=True,
                   sim_spec=sim_params,
                   theory=True,
                   theory_spec=theory_params)
start_job(M.simulation.label, submit_cmd, jobscript_template)

Finally submit like this:

python run_example_downscaled.py

I haven't tested the code above. This is a rough sketch. But I think this is the way to go in your case if you want to use mpi.

jiaduxie commented 3 years ago

The compressed version of the code does not use the script submission method. The settings below seem to make no sense.

# Template for job scripts
jobscript_template = '''
# Instruction for the queuing system

. /home/users/miniconda3/etc/profile.d/conda.sh
conda activate multi_area_model

mpirun -np {num_processes} python {base_path}/run_simulation.py {label} {network_label}'''

# Command to submit jobs on the local cluster
submit_cmd = 'bash'

jarsi commented 3 years ago

It does not use it because it is made for beeing run locally on your desktop machine. Probably only using threads. You are already scaling it up and using mpi.

jarsi commented 3 years ago

Which part does not make sense?

jiaduxie commented 3 years ago

According to the following simulation, how many neurons will be activated? I ran out for about 400 million, right?

'N_scaling': 0.243,
'K_scaling': 0.172,

jiaduxie commented 3 years ago

run_example_downscaled.py did not call the script during the run.

jarsi commented 3 years ago

The total model has approximately 4 million neurons. The formula for downscaling is N_scaling 4 million = 0.243 4 million = 0.972 million.

I also posted a modified version of this script above. It addresses this.

jiaduxie commented 3 years ago

According to this ratio, the number of synapses is 1 billion, right?

jiaduxie commented 3 years ago

I checked the json file and calculated that the number of neurons is 1 million and the number of synapses is 1 billion

jiaduxie commented 3 years ago

What problem did your revised version solve?

jarsi commented 3 years ago

You asked how I would start the simulation. This is the way I think you should do it. But it is just a suggestion.

It prepares the simulation with one process and one thread. It prepares the simulation such that multiple mpi processes can work easily on the data (eg every process has its own configuration file. This solves concurrent data access problems). When everything is finished the job is submitted and all processes can start their work.

INM-6 / multi-area-model

simulation environment of multi node #18