CompPhysVienna / n2p2

n2p2 - A Neural Network Potential Package
https://compphysvienna.github.io/n2p2/
GNU General Public License v3.0
217 stars 82 forks source link

Training Error #132

Closed r-hou closed 2 years ago

r-hou commented 2 years ago

The question I have compiled the newest version of n2p2 on HPC based on eigen-3.4.0, gsl-2.7, and openmpi-4.1.1, gcc-5.5. However, nnp-scaling can successfully generate scaling.data but nnp-train doesn't work. If I use "mpirun --np 28 nnp-traing", it reports the error "terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: n (which is 0) >= this->size() (which is 0) [compute-0-21:50213] Process received signal [compute-0-21:50213] Signal: Aborted (6) [compute-0-21:50213] Signal code: (-6) terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0) [compute-0-21:50197] Process received signal [compute-0-21:50197] Signal: Aborted (6) [compute-0-21:50197] Signal code: (-6) terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: n (which is 0) >= this->size() (which is 0) [compute-0-21:50200] Process received signal [compute-0-21:50200] Signal: Aborted (6) [compute-0-21:50200] Signal code: (-6) [compute-0-21:50213] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7efd0a4d05e0] [compute-0-21:50213] [ 1] [compute-0-21:50197] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7f60815855e0] [compute-0-21:50197] [ 1] [compute-0-21:50200] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7f9f0ae8b5e0] [compute-0-21:50200] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f60811e81f7] [compute-0-21:50197] [ 2] /lib64/libc.so.6(gsignal+0x37)[0x7efd0a1331f7] [compute-0-21:50213] [ 2] /lib64/libc.so.6(gsignal+0x37)[0x7f9f0aaee1f7] [compute-0-21:50200] [ 2] /lib64/libc.so.6(abort+0x148)[0x7efd0a1348e8] [compute-0-21:50213] [ 3] /lib64/libc.so.6(abort+0x148)[0x7f60811e98e8] [compute-0-21:50197] [ 3] /lib64/libc.so.6(abort+0x148)[0x7f9f0aaef8e8] [compute-0-21:50200] [ 3] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(_ZN9gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7f6081d3998d] [compute-0-21:50197] [ 4] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(_ZN9gnu_cxx27verbose_terminate_handlerEv+0x15d)[0x7efd0ac8498d] [compute-0-21:50213] [ 4] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(_ZN9__gnu_cxx27verbose_terminate_handlerEv+0x15d)[0x7f9f0b63f98d] [compute-0-21:50200] [ 4] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8c9e6)[0x7efd0ac829e6] [compute-0-21:50213] [ 5] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8c9e6)[0x7f6081d379e6] [compute-0-21:50197] [ 5] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8c9e6)[0x7f9f0b63d9e6] [compute-0-21:50200] [ 5] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8ca31)[0x7f9f0b63da31] [compute-0-21:50200] [ 6] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8ca31)[0x7f6081d37a31] [compute-0-21:50197] [ 6] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8ca31)[0x7efd0ac82a31] [compute-0-21:50213] [ 6] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8cc49)[0x7efd0ac82c49] [compute-0-21:50213] [ 7] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8cc49)[0x7f9f0b63dc49] [compute-0-21:50200] [ 7] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(+0x8cc49)[0x7f6081d37c49] [compute-0-21:50197] [ 7] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(_ZSt24throw_out_of_range_fmtPKcz+0xf5)[0x7efd0acaa695] [compute-0-21:50213] [ 8] nnp-train[0x425649] [compute-0-21:50213] [ 9] nnp-train[0x43a1d3] [compute-0-21:50213] [10] nnp-train[0x405a52] [compute-0-21:50213] [11] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(_ZSt24__throw_out_of_range_fmtPKcz+0xf5)[0x7f9f0b665695] [compute-0-21:50200] [ 8] nnp-train[0x425649] [compute-0-21:50200] [ 9] nnp-train[0x43a1d3] [compute-0-21:50200] [10] nnp-train[0x405a52] [compute-0-21:50200] [11] /home/rhou/opt/gcc/gcc-5.5.0/build-dir/lib64/libstdc++.so.6(_ZSt24throw_out_of_range_fmtPKcz+0xf5)[0x7f6081d5f695] [compute-0-21:50197] [ 8] nnp-train[0x425649] [compute-0-21:50197] [ 9] nnp-train[0x43a1d3] [compute-0-21:50197] [10] nnp-train[0x405a52] [compute-0-21:50197] [11] /lib64/libc.so.6(libc_start_main+0xf5)[0x7efd0a11fc05] [compute-0-21:50213] [12] nnp-train[0x40684d] [compute-0-21:50213] End of error message /lib64/libc.so.6(libc_start_main+0xf5)[0x7f60811d4c05] [compute-0-21:50197] [12] nnp-train[0x40684d] [compute-0-21:50197] End of error message /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9f0aadac05] [compute-0-21:50200] [12] nnp-train[0x40684d] [compute-0-21:50200] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 6 with PID 50200 on node compute-0-21 exited on signal 6 (Aborted). --------------------------------------------------------------------------"

But if run it serially by "nnp-train", it works but very slow. Thanks for any answer!

singraber commented 2 years ago

Hello!

How many configurations do you have in your input.datafile? So if a serial run works, does it work with less cores than 28? Can you provide a minimal example which reproduces the error?