astro-informatics / purify

Next-generation radio interferometric imaging.
https://astro-informatics.github.io/purify
GNU General Public License v2.0
17 stars 13 forks source link

'bad alloc' after using distributed image MPI measurement operator #267

Closed lao19881213 closed 5 years ago

lao19881213 commented 5 years ago

Hi, I try to use purify(v3.0.1) processing a measurement set data, but failed ... There is error message:

[2019-07-16 13:45:11.848] [purify] [debug] Adding channel 62 to plane... [2019-07-16 13:46:21.503] [purify] [debug] Adding channel 63 to plane... [2019-07-16 13:58:28.759] [purify] [critical] Using cell size 8" x 8", recommended from the uv coverage and field of view is 8.9588"x7.89797". [2019-07-16 13:59:03.584] [purify] [debug] Using distributed image MPI measurement operator. terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc [lbq-TP300:28858] Process received signal [lbq-TP300:28858] Signal: Aborted (6) [lbq-TP300:28858] Signal code: (-6) [lbq-TP300:28858] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f4e9f933330] [lbq-TP300:28858] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f4e9f591c37] [lbq-TP300:28858] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f4e9f595028] [lbq-TP300:28858] [ 3] /home/lbq/work/purify_new/gcc/lib64/libstdc++.so.6(_ZN9__gnu_cxx27verbose_terminate_handlerEv+0x125)[0x7f4ea031b065] [lbq-TP300:28858] [ 4] /home/lbq/work/purify_new/gcc/lib64/libstdc++.so.6(+0x8ee56)[0x7f4ea0318e56] [lbq-TP300:28858] [ 5] /home/lbq/work/purify_new/gcc/lib64/libstdc++.so.6(+0x8eea1)[0x7f4ea0318ea1] [lbq-TP300:28858] [ 6] /home/lbq/work/purify_new/gcc/lib64/libstdc++.so.6(+0x8f0e3)[0x7f4ea03190e3] [lbq-TP300:28858] [ 7] purify[0x41ed37] [lbq-TP300:28858] [ 8] purify(_ZN6purify9utilities10visparamsC1ERKS1+0xc31)[0x42cea1] [lbq-TP300:28858] [ 9] /home/lbq/work/purify_new/purify/lib/libpurify.so.3.0(_ZN6purify9utilities13set_cell_sizeERKNS0_10vis_paramsERKdS5_S5S5+0x28)[0x7f4ea4917858] [lbq-TP300:28858] [10] /home/lbq/work/purify_new/purify/lib/libpurify.so.3.0(_ZN6purify9utilities13set_cell_sizeERKNS0_10visparamsERKdS5+0xa7)[0x7f4ea4918437] [lbq-TP300:28858] [11] /home/lbq/work/purify_new/purify/lib/libpurify.so.3.0(_ZN6purify9utilities13set_cell_sizeERKN4sopt3mpi12CommunicatorERKNS0_10visparamsERKdSA+0x109)[0x7f4ea49972c9] [lbq-TP300:28858] [12] purify[0x45bc18] [lbq-TP300:28858] [13] purify[0x416be0] [lbq-TP300:28858] [14] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf5)[0x7f4e9f57cf45] [lbq-TP300:28858] [15] purify[0x41928e] [lbq-TP300:28858] End of error message Aborted (core dumped)

How to fix this problem? Thanks.

Luke-Pratley commented 5 years ago

@lao19881213 If you are running the data on a cluster, how many visibilities and how many nodes are you using?

lao19881213 commented 5 years ago

The visibilities has about 90 million rows. I only used one node to run purify. The max size memory of this node is 8GB.

在 2019-07-17 22:00:19, "Luke Pratley" notifications@github.com 写道:

@lao19881213 If you are running the data on a cluster, how many visibilities and how many nodes are you using?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Luke-Pratley commented 5 years ago

At the moment we are storing all of the gridding kernels (which is not ideal, but has been easier for development), and it takes up much more working memory :(. We (at least I) will plan fix this in the coming months, as it is an implementation issue. Assuming no wide-field effects (w = 0), the total memory used by the gridding kernels is 90 16 4 * 4 Gb = 23 Gb of memory. 8 Gb per a node is very small :(. The easy short term solution is to use more nodes.