Open parrotsky opened 2 years ago
Hello @parrotsky ,
This is unusual, Caliper shouldn't affect MPI progress when going from intra- to inter-node communication. Does this only happen when Caliper is enabled? It's possible the issue is in the underlying program. In particular, pay close attention to the order of communications between the processes and make sure you're not stuck in a blocking MPI_Send
. It's possible that an MPI_Send
finishes immediately for a target process on the same node but waits for a matching MPI_Recv
to be called first when it goes over the network.
Hello @parrotsky ,
This is unusual, Caliper shouldn't affect MPI progress when going from intra- to inter-node communication. Does this only happen when Caliper is enabled? It's possible the issue is in the underlying program. In particular, pay close attention to the order of communications between the processes and make sure you're not stuck in a blocking
MPI_Send
. It's possible that anMPI_Send
finishes immediately for a target process on the same node but waits for a matchingMPI_Recv
to be called first when it goes over the network.
Hi, @daboehme Thanks for your reply. The MPI_Send in the profiling log reminds me that the Caliper may duplicate the MPI_Comm. And the program may stuck in the MPI_Send or MPI_recv. In the multi-process single device profiling report, we can find Both MPI_Send and MPI_Recv are called. However, in the hello_world example, I simply call the function MPI_Get_rank, without MPI_Send or MPI_Recv. So, I agree the problem may exist in the Caliper that some processes with MPI_Send/Recv finished too early.
Hi, First I would like to thank the contributors for providing such an elegant and easy-to-go library to profile MPI programs. MY problem: I built a mpi cluster within a lan with up to 8 devices (Linux Ubuntu 20.04) according to the MPI tutorial. I want to use Caliper to profile my applications over multiple devices. And before that, I wrote a simple hello world to test if it works. The code is as below:
the program works perfectly with multi-threads on a single device.
When I test them over two devices(nodes), the program could not return normally and got stuck in somewhere.
Is there anybody who encounters the same issue or figure out where the bug locates? Thanks a lot for answering.