Closed Kenkentake closed 3 years ago
実行結果(.outファイルの中身は空で、MPI_Comm_spawnする際のノード数が足りないエラーが発生)
[u00690@Fugaku ~/optimizePara/src_test]$cat job_test.sh.7469388.err.1.0
[a25-5109c:00079] [mpi::dpm::spawn-resource-error] [[17402,5882],0] There are not enough compute nodes to create processes dynamically according to the requirement.
[mpi::mpi-errors::mpi_errors_are_fatal]
[a25-5109c:00079] *** An error occurred in MPI_Comm_spawn
[a25-5109c:00079] *** reported by process [1140463354,281470681743360]
[a25-5109c:00079] *** on communicator MPI_COMM_SELF
[a25-5109c:00079] *** Unknown error (this should not happen!)
[a25-5109c:00079] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[a25-5109c:00079] *** and potentially your MPI job)
-mpi "max-proc-per-node= 8
で1ノードあたりのプロセス数を8に設定したら、エラーは出るもののテストコードは回って想定の出力が得られた。(8の値は福田さんの修論から参照)job_tesh.shは以下
#!/bin/bash
#PJM --rsc-list "node=8"
#PJM --rsc-list "elapse=10:00"
#PJM --mpi "shape=1"
#PJM --mpi "max-proc-per-node=8"
#PJM -S
export PARALLEL=8
export OMP_NUM_THREADS=$PARALLEL
export PLE_MPI_STD_EMPTYFILE="off"
# export OMP_WAIT_POLICY=ACTIVE
# argv for estimate_main to run test
NUM_OF_POP=16
MU=8
NUM_OF_CHILD_PROCS=2
MAXEVAL=4
NUM_OF_GRANDCHILD_PROCS=2
EXEC_PROG="./test_est_target"
DIM_CON_MAT=4
CON_MAT_NAME="../data/conMat_test.txt";
PARAMETER_FILENAME="../data/params_test.txt";
MPIEXEC="mpiexec -mca mpi_print_stats 1"
NPROC="-n 1"
PROF=""
# minimal estimate_main
EXEC_FILE="./estimate_main ${NUM_OF_POP} ${MU} ${NUM_OF_CHILD_PROCS} ${MAXEVAL} ${NUM_OF_GRANDCHILD_PROCS} ${EXEC_PROG} ${DIM_CON_MAT} ${CON_MAT_NAME} ${PARAMETER_FILENAME}"
# python3
# module load Python3-CN
# export FLIB_CNTL_BARRIER_ERR=FALSE
# execute job
mpiexec -np 1 ${EXEC_FILE}
出力された.outファイル(.out.{何回目のmpiexecの実行か}.{ランク番号}@spawn番号)
connection_data = ../data/conMat_test.txt
dimension = 15, num_of_cell_combination = 16
info@make_neuro_spawn:
num_of_my_pop=8, dimension=15, num_of_procs_nrn=4, exec_prog=./test_est_target, dim_conMat=4, connection_data=../data/conMat_test.txt
end of loop
end of loop
connection_data = ../data/conMat_test.txt
dimension = 15, num_of_cell_combination = 16
info@make_neuro_spawn:
num_of_my_pop=8, dimension=15, num_of_procs_nrn=4, exec_prog=./test_est_target, dim_conMat=4, connection_data=../data/conMat_test.txt
end of loop
end of loop
end of loop
end of loop
end of loop
end of loop
出力された.errファイル
*** Error in munmap_chunk(): invalid pointer: 0x0000ffffffffdb28 ***
======= Backtrace: =========
/opt/FJSVxos/mmm/lib64/libmpg.so.1(+0x8044)[0x400000d68044]
/opt/FJSVxos/mmm/lib64/libmpg.so.1(+0x849c)[0x400000d6849c]
./make_neuro_spawn[0x4022fc]
/lib64/libc.so.6(__libc_start_main+0xe4)[0x400000ff0be4]
./make_neuro_spawn[0x40138c]
======= Memory map: ========
00400000-00410000 r-xp 00000000 66:b8aee 180218084538218168 /vol0004/hp200177/u00690/optimizePara/src_test/optimizePara/src_forSB/make_neuro_spawn
02000000-02200000 rw-p 00000000 00:0f 10324794 /anon_hugepage (deleted)
:
:
400005400000-400006410000 rw-s 00000000 00:00 0
fff80bc00000-1000000000000 rw-p 00000000 00:0f 10324797 /memfd: [stack] by libmpg (deleted)
[c26-2204b:00010] *** Process received signal ***
[c26-2204b:00010] Signal: Aborted (6)
[c26-2204b:00010] Signal code: (-6)
[c26-2204b:00010] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x40000006066c]
[c26-2204b:00010] [ 1] /lib64/libc.so.6(gsignal+0xac)[0x400001002c1c]
[c26-2204b:00010] [ 2] /lib64/libc.so.6(abort+0x110)[0x400000ff07a8]
[c26-2204b:00010] [ 3] /opt/FJSVxos/mmm/lib64/libmpg.so.1(+0x8048)[0x400000d68048]
[c26-2204b:00010] [ 4] /opt/FJSVxos/mmm/lib64/libmpg.so.1(+0x849c)[0x400000d6849c]
[c26-2204b:00010] [ 5] ./make_neuro_spawn[0x4022fc]
[c26-2204b:00010] [ 6] /lib64/libc.so.6(__libc_start_main+0xe4)[0x400000ff0be4]
[c26-2204b:00010] [ 7] ./make_neuro_spawn[0x40138c]
[c26-2204b:00010] *** End of error message ***
*** Error in munmap_chunk(): invalid pointer: 0x0000ffffffffdb28 ***
======= Backtrace: =========
/opt/FJSVxos/mmm/lib64/libmpg.so.1(+0x8044)[0x400000d68044]
/opt/FJSVxos/mmm/lib64/libmpg.so.1(+0x849c)[0x400000d6849c]
./make_neuro_spawn[0x4022fc]
/lib64/libc.so.6(__libc_start_main+0xe4)[0x400000ff0be4]
./make_neuro_spawn[0x40138c]
======= Memory map: ========
00400000-00410000 r-xp 00000000 66:b8aee 180218084538218168 /vol0004/hp200177/u00690/optimizePara/src_test/optimizePara/src_forSB/make_neuro_spawn
02000000-02200000 rw-p 00000000 00:0f 10848767 /anon_hugepage (deleted)
:
:
fff80bc00000-1000000000000 rw-p 00000000 00:0f 10848770 /memfd: [stack] by libmpg (deleted)
[c26-2204b:00011] *** Process received signal ***
[c26-2204b:00011] Signal: Aborted (6)
[c26-2204b:00011] Signal code: (-6)
[c26-2204b:00011] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x40000006066c]
[c26-2204b:00011] [ 1] /lib64/libc.so.6(gsignal+0xac)[0x400001002c1c]
[c26-2204b:00011] [ 2] /lib64/libc.so.6(abort+0x110)[0x400000ff07a8]
[c26-2204b:00011] [ 3] /opt/FJSVxos/mmm/lib64/libmpg.so.1(+0x8048)[0x400000d68048]
[c26-2204b:00011] [ 4] /opt/FJSVxos/mmm/lib64/libmpg.so.1(+0x849c)[0x400000d6849c]
[c26-2204b:00011] [ 5] ./make_neuro_spawn[0x4022fc]
[c26-2204b:00011] [ 6] /lib64/libc.so.6(__libc_start_main+0xe4)[0x400000ff0be4]
[c26-2204b:00011] [ 7] ./make_neuro_spawn[0x40138c]
[c26-2204b:00011] *** End of error message ***
上記のエラーはmake_nauro_spawn.cでmalloc
とcalloc
で動的に確保していたメモリをfree
で解放していなかったのが原因
make_neuro_spawn.cに以下を追加したらエラーが解消された
free(pop_rcvbuf_whole);
free(pop_sendbuf_nrn_weight_adjust_dim);
メモ: freeで動的メモリを開放した後にポインタをNULLクリアしているのは二重freeを防ぐため(参照)
export PARALLEL=8
export OMP_NUM_THREADS=$PARALLEL
エラーなく実行可能な状態になった
Summary
Run test estimation (not using Neuron) on Fugaku
Goal
Run test estimation (not using Neuron) on Fugaku and merge
Todo
Deadline
08/16
Parent issue
None
References
None
Notes
None