chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.76k stars 415 forks source link

DistributedBag module: inter-locale work stealing mechanism #19860

Open Guillaume-Helbecque opened 2 years ago

Guillaume-Helbecque commented 2 years ago

Summary of Problem

Theoretically, when we declare a distributed bag, each locale has its own bag instance, and a dynamic inter-locale work stealing mechanism is supposed to occur when a locale's bag instance is empty.

Let's consider 2 locales for the description of the issue (but it is also true for any number of locales): When locale 0 has a non-empty bag instance and locale 1 has an empty bag instance, locale 1 is not able to steal work on locale 0. This happens either when locale 1 starts the execution with an empty bag instance, either when its bag instance becomes empty during execution.

We took a look at https://github.com/chapel-lang/chapel/blob/main/modules/packages/DistributedBag.chpl and found that this behavior seems to come from an if condition that is never verified:

if !targetBag!.loadBalanceInProgress.read() { ... }

This condition occurs in the REMOVE_WORST_CASE, where we attempt to steal work on another locale. It seems that the two other cases REMOVE_BEST_CASE and REMOVE_AVERAGE_CASE work fine.

Steps to Reproduce

Source Code:

use DistributedBag;

config const n = 2000;

var bag = new DistBag(int, targetLocales=Locales); // creation of the bag

bag.addBulk(1..n); // insertion of elements

writeln("Initial bag size: ", bag.getSize(), "\n");

coforall loc in Locales do on loc do {
  var counter: atomic int = 0; // local counter

  coforall tid in 0..#here.maxTaskPar do {

    while true {
      var (empty, x): (bool, int) = bag.remove(); // we try to remove an element
      if (empty == true){ // if we successfully removed an element
        counter.add(1); // update the counter
      }

      if (bag.getSize() == 0){ // if the global bag is empty
        break;
      }
    }

  } // end coforall threads

  writeln(counter, " element(s) removed from ", here);
} // end coforall locales

Compile command:

chpl foo.chpl -o foo.o --fast

Execution command:

./foo.o -nl 2

Output:

Initial bag size: 2000

0 element(s) removed from LOCALE1
2000 element(s) removed from LOCALE0

Due to the dynamic inter-locale work stealing mechanism, we expected the locales to remove approximately the same number of elements.

Configuration Information

lydia-duncan commented 2 years ago

@LouisJenkinsCS - thoughts on this issue? I don't feel I understand the inner workings well enough myself

LouisJenkinsCS commented 2 years ago

While I am unable to locate the specific cause of the issue at a glance, I will ask: What happens if you call balance on it to force it to rebalance the workload?

This method is very heavy-weight in that it should not be called too often. Dynamic work stealing handles cases where there is a relatively fair distribution across majority of nodes, but this should be called when you have a severe imbalance, or when you have a smaller number of elements to balance. Furthermore, while this operation is parallel-safe, it should be called from at most one task.

In this case, given the severe imbalance, this may be the time to call said function. Also, wondering as well, is ran on an actual cluster/supercomputer or locally? If locally, is it possible that locale 0 is finishing all of its tasks before task 1 has the chance to try to steal work?

I personally would just add:

bag.addBulk(1..n); // insertion of elements
bag.balance();

As shown here: https://github.com/chapel-lang/chapel/blob/bed485d233e45c38638b20194ad6ab8698f894d1/modules/packages/DistributedBag.chpl#L102-L103

LouisJenkinsCS commented 2 years ago

Also note that the segment size starts at 1024 and goes up to 1M, and IIRC, work stealing is done at the granularity of segments; try increasing the workload to much larger amounts. The number of segments is the maximum parallelism here.maxTaskPar and so if you have a large powerful machine, the work will be spread out all over locale 0.

To limit work stealing from already starving locale segments, you can also increase the minimum number of elements required to steal work: https://github.com/chapel-lang/chapel/blob/bed485d233e45c38638b20194ad6ab8698f894d1/modules/packages/DistributedBag.chpl#L1170

Guillaume-Helbecque commented 2 years ago

I'm using this data structure on a supercomputer cluster. Moreover, for the issue reporting, I proposed a simple example, but in practice, I'm inserting Node records inside the bag (containing some int and a tuple of int), and I'm dealing with tens of millions of Node.

As you suggested, I tried to tune the parameters of the bag, without seeing any improvements. I think that this can be explain by the fact that these parameters are invoked inside an if condition that is never verified (in my experiments): https://github.com/chapel-lang/chapel/blob/bed485d233e45c38638b20194ad6ab8698f894d1/modules/packages/DistributedBag.chpl#L1165

Guillaume-Helbecque commented 2 years ago

Actually, a possible alternative is to use the static balance function, that seems to work fine. However, as you said, this function is "very heavy-weight and should not be called too often".

LouisJenkinsCS commented 2 years ago

I see. If you wish to do some quick profiling, you could modify the code to introduce a counter for how often that portion of code (REMOVE_WORSE_CASE) is called, how often it tries to steal work or fails to, etc. I don't have time for it right now myself, unfortunately. I've run into some subtle compiler bugs that introduced undefined behavior before (i.e. one where it treated a ref to an int as a pointer when doing a comparison rather than the value it points to, (int *) < int rather than *(int *) < int in code generation) but those are super unpredictable and hard to detect. There could also be an oversight somewhere in my implementation, but I recall dynamic load balancing working relatively well so long as the static balance is called once at the beginning.