LLNL / Caliper

Caliper is an instrumentation and performance profiling library
http://software.llnl.gov/Caliper/
BSD 3-Clause "New" or "Revised" License
352 stars 67 forks source link

Can not return in multi-node MPI applications #429

Open parrotsky opened 2 years ago

parrotsky commented 2 years ago

Hi, First I would like to thank the contributors for providing such an elegant and easy-to-go library to profile MPI programs. MY problem: I built a mpi cluster within a lan with up to 8 devices (Linux Ubuntu 20.04) according to the MPI tutorial. I want to use Caliper to profile my applications over multiple devices. And before that, I wrote a simple hello world to test if it works. The code is as below:

#include <mpi.h>
#include <stdio.h>
#include <caliper/cali.h>
#include <caliper/cali-manager.h>
// ...
// ...
int main(int argc, char** argv) {

    //l Initialize the MPI environment
    cali::ConfigManager mgr;
    mgr.add("runtime-report,event-trace(output=trace.cali)");
    int provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
    if (provided < MPI_THREAD_MULTIPLE) {
        fprintf(stderr, "xxx MPI does not provide needed thread support!\n");
        return -1;
        // Error - MPI does not provide needed threading level
    }

    //     MPI_Init(&argc, &argv);

    mgr.start(); 
    // ...
    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    //   CALI_MARK_BEGIN("iemann_slice_precompute");
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    //CALI_MARK_END("iemann_slice_precompute");
    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
            processor_name, world_rank, world_size);

    // Finalize the MPI environment.
    //
    mgr.flush();
    mgr.stop();
    MPI_Finalize();
}

the program works perfectly with multi-threads on a single device.

sky@nx01:~/cloud$ mpirun -np 2 ./hello
Hello world from processor nx01, rank 0 out of 2 processors
Hello world from processor nx01, rank 1 out of 2 processors
Path                   Min time/rank Max time/rank Avg time/rank Time %    
MPI_Comm_dup                0.000952      0.001182      0.001067 13.165525 
MPI_Get_processor_name      0.000133      0.000193      0.000163  2.011228 
Function               Count (min) Count (max) Time (min) Time (max) Time (avg) Time %    
                                 9          13   0.040653   0.040994   0.040823 92.516799 
MPI_Comm_dup                     2           2   0.001527   0.002249   0.001888  4.278705 
MPI_Recv                         4           4   0.000935   0.000935   0.000935  1.059478 
MPI_Comm_free                    1           1   0.000170   0.000287   0.000228  0.517841 
MPI_Get_processor_name           1           1   0.000170   0.000285   0.000228  0.515575 
MPI_Send                         4           4   0.000421   0.000421   0.000421  0.477048 
MPI_Finalize                     1           1   0.000069   0.000134   0.000102  0.230026 
MPI_Probe                        2           2   0.000186   0.000186   0.000186  0.210762 
MPI_Get_count                    2           2   0.000171   0.000171   0.000171  0.193766 

When I test them over two devices(nodes), the program could not return normally and got stuck in somewhere.

sky@nx01:~/cloud$ mpirun -np 2 --host nx01,nx02 ./hello
Hello world from processor nx02, rank 1 out of 2 processors
Hello world from processor nx01, rank 0 out of 2 processors
Path                   Min time/rank Max time/rank Avg time/rank Time %    
MPI_Comm_dup                0.003007      0.003007      0.003007 29.905520 
MPI_Get_processor_name      0.000132      0.000132      0.000132  1.312780 

Is there anybody who encounters the same issue or figure out where the bug locates? Thanks a lot for answering.

daboehme commented 2 years ago

Hello @parrotsky ,

This is unusual, Caliper shouldn't affect MPI progress when going from intra- to inter-node communication. Does this only happen when Caliper is enabled? It's possible the issue is in the underlying program. In particular, pay close attention to the order of communications between the processes and make sure you're not stuck in a blocking MPI_Send. It's possible that an MPI_Send finishes immediately for a target process on the same node but waits for a matching MPI_Recv to be called first when it goes over the network.

parrotsky commented 2 years ago

Hello @parrotsky ,

This is unusual, Caliper shouldn't affect MPI progress when going from intra- to inter-node communication. Does this only happen when Caliper is enabled? It's possible the issue is in the underlying program. In particular, pay close attention to the order of communications between the processes and make sure you're not stuck in a blocking MPI_Send. It's possible that an MPI_Send finishes immediately for a target process on the same node but waits for a matching MPI_Recv to be called first when it goes over the network.

Hi, @daboehme Thanks for your reply. The MPI_Send in the profiling log reminds me that the Caliper may duplicate the MPI_Comm. And the program may stuck in the MPI_Send or MPI_recv. In the multi-process single device profiling report, we can find Both MPI_Send and MPI_Recv are called. However, in the hello_world example, I simply call the function MPI_Get_rank, without MPI_Send or MPI_Recv. So, I agree the problem may exist in the Caliper that some processes with MPI_Send/Recv finished too early.