NexGenAnalytics / MIT-MUQ

BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

examples failing for MPI #88

Open fnrizzi opened 4 weeks ago

fnrizzi commented 4 weeks ago

In examples, currently there are the following tests that use the parallel features of MUQ and some fail

passing

*** single chain reference Starting single chain MCMC sampler... 10% Complete Block 0: MHKernel acceptance Rate = 33% 20% Complete Block 0: MHKernel acceptance Rate = 36% 30% Complete Block 0: MHKernel acceptance Rate = 38% 40% Complete Block 0: MHKernel acceptance Rate = 38% 50% Complete Block 0: MHKernel acceptance Rate = 37% 60% Complete Block 0: MHKernel acceptance Rate = 37% 70% Complete Block 0: MHKernel acceptance Rate = 37% 80% Complete Block 0: MHKernel acceptance Rate = 38% 90% Complete Block 0: MHKernel acceptance Rate = 37% 100% Complete Block 0: MHKernel acceptance Rate = 37% Completed in 0.00816509 seconds. mean QOI: 0.838309 1.7598


# failing 

- `ParallelMultilevelMonteCarlo.cpp` this fails for mpi 

[2024-10-24 14:51:53.112] [info] Balancing load across 0 ranks terminate called after throwing an instance of 'boost::wrapexcept' what(): No such node (MLMCMC.Scheduling) [poisson:429676] Process received signal [poisson:429676] Signal: Aborted (6) [poisson:429676] Signal code: (-6) [poisson:429676] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7bb955242520] [poisson:429676] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7bb9552969fc] [poisson:429676] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7bb955242476] [poisson:429676] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7bb9552287f3] [poisson:429676] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7bb9556a2b9e] [poisson:429676] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7bb9556ae20c] [poisson:429676] [ 6] [poisson:429675] An error occurred in MPI_Send [poisson:429675] reported by process [457965569,0] [poisson:429675] on communicator MPI_COMM_WORLD [poisson:429675] MPI_ERR_RANK: invalid rank [poisson:429675] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [poisson:429675] and potentially your MPI job) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7bb9556ae277] [poisson:429676] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7bb9556ae4d8] [poisson:429676] [ 8] /home/frizzi/Desktop/muq/install/lib/libmuqSamplingAlgorithms.so(_ZN5boost13property_tree11basic_ptreeINSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_St4lessIS7_EE9get_childERKNS0_11string_pathIS7_NS0_13id_translatorIS7_EEEE+0x598)[0x7bb955ed4ee8] [poisson:429676] [ 9] /home/frizzi/Desktop/muq/install/lib/libmuqSamplingAlgorithms.so(_ZN3muq18SamplingAlgorithms25StaticLoadBalancingMIMCMCC2EN5boost13property_tree11basic_ptreeINSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_St4lessISA_EEESt10shared_ptrINS0_32ParallelizableMIComponentFactoryEESE_INS0_18StaticLoadBalancerEESE_IN6parcer12CommunicatorEESE_INS_9Utilities14OTF2TracerBaseEE+0x587)[0x7bb955ff4767] [poisson:429676] [10] ./ParallelMultilevelMonteCarlo(+0x143c2)[0x5d5f4472f3c2] [poisson:429676] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7bb955229d90] [poisson:429676] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7bb955229e40] [poisson:429676] [13] ./ParallelMultilevelMonteCarlo(+0x13205)[0x5d5f4472e205] [poisson:429676] End of error message


- `FullParallelMultilevelGaussianSampling` in `Example3_MultilevelGaussian/cpp`:

[2024-10-24 14:56:24.708] [debug] Rank: 0 [2024-10-24 14:56:24.708] [info] Balancing load across 0 ranks [2024-10-24 14:56:24.708] [debug] Rank: 1 [poisson:430094] An error occurred in MPI_Send [poisson:430094] reported by process [86048769,0] [poisson:430094] on communicator MPI_COMM_WORLD [poisson:430094] MPI_ERR_RANK: invalid rank [poisson:430094] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [poisson:430094] and potentially your MPI job)


- `SubsamplingTestMultilevelGaussianSampling` in `Example3_MultilevelGaussian/cpp`:

Running with subsampling 0

*** greedy multillevel chain

Setting up level 0 Setting up level 1 terminate called after throwing an instance of 'boost::wrapexcept' what(): No such node (MLMCMC.Subsampling_0) Running with subsampling 0

*** greedy multillevel chain

Setting up level 0 Setting up level 1 terminate called after throwing an instance of 'boost::wrapexcept' what(): No such node (MLMCMC.Subsampling_0)

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 0 on node poisson exited on signal 6 (Aborted).


- `FullParallelMultiindexGaussianSampling` in `Example4_MultiindexGaussian/cpp`:

[2024-10-24 15:01:39.446] [debug] Rank: 0 [2024-10-24 15:01:39.446] [debug] Rank: 1 [2024-10-24 15:01:39.446] [info] Balancing load across 0 ranks [poisson:430492] An error occurred in MPI_Send [poisson:430492] reported by process [78839809,0] [poisson:430492] on communicator MPI_COMM_WORLD [poisson:430492] MPI_ERR_RANK: invalid rank [poisson:430492] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [poisson:430492] and potentially your MPI job)