Err stats - Githubissues

Taking advantage of idle rank due to load imbalance for computing extra physic model outputs such that those can be compared against surrogate outputs. Currently, average and variation of the error of surrogate outputs of chosen validation points are reported per evaluation iteration.

@JaeseungYeom Thank you for the PR!

I am afraid this version has too much data movement across ranks. The predicate array although boolean can be fairly large in an actual large simulation. Gathering and Scattering them across the distributed system will be a bottleneck. How about this solution here:

unsigned physics_evaluations_per_rank[num_ranks];
// Root gathers only the number of physics evaluations per rank.
MPI_Gather(Root, num_of_my_physics_evaluations, physics_evaluations_per_rank); 

if (rank == 0) {
       total_evaluations = sum(physics_evaluations_per_rank);
       load_balance_slack = math::ceil(total_evaluations/word_size) * world_size - total_evaluations;
       int flipped_indexes[load_balance_slack];
       // randomly select which indexes we will flip.
       for (int i = 0; i < load_balance_slack; i++){
           flipped_indexes[i] = sample(0, total_evaluations -1);
       }
       flipped_indexes = sort(flipped_indexes);
       int additional_evaluations_per_rank[num_ranks];
       int j =0;
       int running_sum = 0;
       for ( i = 0; i < load_balance_slack; i++){
           int flipped_index = flipped_indexes[i];
           // Search among ranks in which index does this element fall in
           do {
             if (flipped_index > running_sum && flipped_index < running_sum +  physics_evaluations_per_rank[j]) {
                  additional_evaluations_per_rank[j] ++;
                  running_sum += physics_evaluations_per_rank[j];
                  break;
             }
             running_sum += physics_evaluations_per_rank[j];
           }
           while (j < num_ranks);
       }

       // Now here additional_evaluations_per_rank contains how many extra evaluations need to happen on every rank.
}
     MPI_Scatter(Root, additional_evaluations_per_rank, &my_evaluations);
     // Now here every rank has in my_evaluations the number of additional toggled predicated it needs to perform.

     // Every rank toggles and operates independently.
     ...

The algorithm probably is not correct but I tried to use to the extend possible meaningful variable names for you to understand. The concept though is to limit the number of data you send over the network. In this code I have 1 Gather and 1 Scatter. The size of the message increases linearly to the number of ranks. Instead you code has 1 Gather, 1 Scatter, 1 GatherV, 1 ScatterV. Gatherv and ScatterV may end up sending a lot of data! Which we cannot sustain in the inner loop.

Let me know if I can further help.

Thank you!

@JaeseungYeom Thank you for the PR!

I am afraid this version has too much data movement across ranks. The predicate array although boolean can be fairly large in an actual large simulation. Gathering and Scattering them across the distributed system will be a bottleneck. How about this solution here:
unsigned physics_evaluations_per_rank[num_ranks];
// Root gathers only the number of physics evaluations per rank.
MPI_Gather(Root, num_of_my_physics_evaluations, physics_evaluations_per_rank); 

if (rank == 0) {
       total_evaluations = sum(physics_evaluations_per_rank);
       load_balance_slack = math::ceil(total_evaluations/word_size) * world_size - total_evaluations;
       int flipped_indexes[load_balance_slack];
       // randomly select which indexes we will flip.
       for (int i = 0; i < load_balance_slack; i++){
           flipped_indexes[i] = sample(0, total_evaluations -1);
       }
       flipped_indexes = sort(flipped_indexes);
       int additional_evaluations_per_rank[num_ranks];
       int j =0;
       int running_sum = 0;
       for ( i = 0; i < load_balance_slack; i++){
           int flipped_index = flipped_indexes[i];
           // Search among ranks in which index does this element fall in
           do {
             if (flipped_index > running_sum && flipped_index < running_sum +  physics_evaluations_per_rank[j]) {
                  additional_evaluations_per_rank[j] ++;
                  running_sum += physics_evaluations_per_rank[j];
                  break;
             }
             running_sum += physics_evaluations_per_rank[j];
           }
           while (j < num_ranks);
       }

       // Now here additional_evaluations_per_rank contains how many extra evaluations need to happen on every rank.
}
     MPI_Scatter(Root, additional_evaluations_per_rank, &my_evaluations);
     // Now here every rank has in my_evaluations the number of additional toggled predicated it needs to perform.

     // Every rank toggles and operates independently.
     ...
The algorithm probably is not correct but I tried to use to the extend possible meaningful variable names for you to understand. The concept though is to limit the number of data you send over the network. In this code I have 1 Gather and 1 Scatter. The size of the message increases linearly to the number of ranks. Instead you code has 1 Gather, 1 Scatter, 1 GatherV, 1 ScatterV. Gatherv and ScatterV may end up sending a lot of data! Which we cannot sustain in the inner loop.

Let me know if I can further help.

Thank you!

I got the idea. It is an excellent suggestion. I will implement something along the line of this.

LLNL / AMS

Err stats #76