astro-informatics / sopt

Sparse OPTimisation using state-of-the-art convex optimisation algorithms.
http://astro-informatics.github.io/sopt/
GNU General Public License v2.0
9 stars 10 forks source link

SARA communicator causes a crash when sara.size() < comm.size() #180

Closed Luke-Pratley closed 6 years ago

Luke-Pratley commented 6 years ago

When there are more MPI procs than SARA wavelets, SARA crashes.

Maybe this can be fixed with a split communicator.

Luke-Pratley commented 6 years ago

@ilectra I get the error output, using 5 nodes and 3 wavelets. The line of code where the problem happens is

 89   t_real const gamma
 90       = (Psi.adjoint() * (measurements->adjoint() * uv_data.vis)).cwiseAbs().maxCoeff() * 1e-3;

in cpp/example/padmm_mpi_random_coverage.cc.

Assertion failed: (mat.rows()>0 && mat.cols()>0 && "you are using an empty matrix"), function run, file /Users/luke/dev/pur
ify/build/external/include/eigen3/Eigen/src/Core/Redux.h, line 175.
[lukes-MacBook-Air:91370] *** Process received signal ***
[lukes-MacBook-Air:91370] Signal: Abort trap: 6 (6)
[lukes-MacBook-Air:91370] Signal code:  (0)
[lukes-MacBook-Air:91370] [ 0] 0   libsystem_platform.dylib            0x00007fff7d19af5a _sigtramp + 26
[lukes-MacBook-Air:91370] [ 1] 0   ???                                 0xbf67e96c66830cf2 0x0 + 13792249035631037682
[lukes-MacBook-Air:91370] [ 2] 0   libsystem_c.dylib                   0x00007fff7cfc630a abort + 127
[lukes-MacBook-Air:91370] [ 3] 0   libsystem_c.dylib                   0x00007fff7cf8e360 basename_r + 0
[lukes-MacBook-Air:91370] [ 4] 0   global_epsilon_replicated_grids     0x0000000100af82f0 _ZN5Eigen8internal10redux_implINS
0_13scalar_max_opIdEENS_12CwiseUnaryOpINS0_13scalar_abs_opISt7complexIdEEEKNS_13ReturnByValueIN4sopt7details15AppliedFuncti
onIRKSt8functionIFvRNS_6MatrixIS7_Lin1ELi1ELi0ELin1ELi1EEERKSF_EENS_10MatrixBaseINS9_INSC_ISM_NSN_ISF_EEEEEEEEEEEEEELi0ELi0
EE3runERKSV_RKS3_ + 115
[lukes-MacBook-Air:91370] [ 5] 0   global_epsilon_replicated_grids     0x0000000100aec9a2 _ZNK5Eigen9DenseBaseINS_12CwiseUn
aryOpINS_8internal13scalar_abs_opISt7complexIdEEEKNS_13ReturnByValueIN4sopt7details15AppliedFunctionIRKSt8functionIFvRNS_6M
atrixIS5_Lin1ELi1ELi0ELin1ELi1EEERKSD_EENS_10MatrixBaseINS7_INSA_ISK_NSL_ISD_EEEEEEEEEEEEEEE5reduxINS2_13scalar_max_opIdEEEENS2_9result_ofIFT_dEE4typeERKSZ_ + 46
[lukes-MacBook-Air:91370] [ 6] 0   global_epsilon_replicated_grids     0x0000000100adf4c5 _ZNK5Eigen9DenseBaseINS_12CwiseUn
aryOpINS_8internal13scalar_abs_opISt7complexIdEEEKNS_13ReturnByValueIN4sopt7details15AppliedFunctionIRKSt8functionIFvRNS_6M
atrixIS5_Lin1ELi1ELi0ELin1ELi1EE
ERKSD_EENS_10MatrixBaseINS7_INSA_ISK_NSL_ISD_EEEEEEEEEEEEEEE8maxCoeffEv + 43
[lukes-MacBook-Air:91370] [ 7] 0   global_epsilon_replicated_grids     0x0000000100ad65fa _Z13padmm_factoryRKSt10shared_ptr
IKN4sopt15LinearTransformIN5Eigen6MatrixISt7complexIdELin1ELi1ELi0ELin1ELi1EEEEEERKNS0_8wavelets4SARAERKNS2_5ArrayIS5_Lin1E
Lin1ELi0ELin1ELin1EEERKN6purify9utilities10vis_paramsEdRKNS0_3mpi12CommunicatorE + 522
[lukes-MacBook-Air:91370] [ 8] 0   global_epsilon_replicated_grids     0x0000000100ad7b71 main + 1918
[lukes-MacBook-Air:91370] [ 9] 0   libdyld.dylib                       0x00007fff7cf1a145 start + 1
[lukes-MacBook-Air:91370] *** End of error message ***
[lukes-MacBook-Air:91369] *** Process received signal ***
[lukes-MacBook-Air:91369] Signal: Abort trap: 6 (6)
[lukes-MacBook-Air:91369] Signal code:  (0)
[lukes-MacBook-Air:91369] [ 0] 0   libsystem_platform.dylib            0x00007fff7d19af5a _sigtramp + 26
[lukes-MacBook-Air:91369] [ 1] 0   ???                                 0xbf67e96c66830cf2 0x0 + 13792249035631037682
[lukes-MacBook-Air:91369] [ 2] 0   libsystem_c.dylib                   0x00007fff7cfc630a abort + 127
[lukes-MacBook-Air:91369] [ 3] 0   libsystem_c.dylib                   0x00007fff7cf8e360 basename_r + 0
[lukes-MacBook-Air:91369] [ 4] 0   global_epsilon_replicated_grids     0x000000010f3eb2f0 _ZN5Eigen8internal10redux_implINS
0_13scalar_max_opIdEENS_12CwiseUnaryOpINS0_13scalar_abs_opISt7complexIdEEEKNS_13ReturnByValueIN4sopt7details15AppliedFuncti
onIRKSt8functionIFvRNS_6MatrixIS7_Lin1ELi1ELi0ELin1ELi1EEERKSF_EENS_10MatrixBaseINS9_INSC_ISM_NSN_ISF_EEEEEEEEEEEEEELi0ELi0
EE3runERKSV_RKS3_ + 115
[lukes-MacBook-Air:91369] [ 5] 0   global_epsiloAssertion failed: (mat.rows()>0 && mat.cols()>0 && "you are using an empty
matrix"), function run, file /Users/luke/dev/purify/build/external/include/eigen3/Eigen/src/Core/Redux.h, line 175.
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node lukes-MacBook-Air exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------

This error says that there is an empty matrix/vector, possibly when trying to use cwiseAbs() or maxCoeff().

Luke-Pratley commented 6 years ago

@ilectra It is definitely a problem with that line of code. Which suggests it is not really a problem with SARA. If I replace that line with t_real gamma = 1;, there are no problems... Maybe the adjoint of SARA is returning an empty Vector, and .maxCoeff() is trying to find the maximum of it on the extra nodes?

Luke-Pratley commented 6 years ago

factor = 0 in https://github.com/astro-informatics/sopt/blob/development/cpp/sopt/wavelets.h#L227 , which would given an empty vector on nodes without wavelets. Probably causing the error!