chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.76k stars 413 forks source link

Controlling `dataPar*` settings on a per-loop instead of per-distribution basis #23741

Open alvaradoo opened 8 months ago

alvaradoo commented 8 months ago

Summary of Request

Changing how much data parallelism occurs can be done by setting the flag CHPL_RT_NUM_THREADS_PER_LOCALE before executing a program or through the blockDist initializer by doing new blockDist({1..n}, dataParTasksPerLocale=2). However, certain programs may have a combination of coarse-grained and fine-grained parallel code on the same domain/array where the fine-grained code would have many communications in multilocale mode and is faster when dataParTasksPerLocal is smaller on GASNet and InfiniBand systems. So far, the only solution I have thought of is to copy arrays over on a per-use basis with a specific dataParTasksPerLocale value for the copy. However, this would incur overheads during the copy, and could possibly be done nicer at the language-level.

The idea would be to allow users to call a separate iterator that would set the maximum number of tasks that the iterator should use for that domain or array. A small example follows below of what the intended functionality could look like:

use BlockDist;
config var numTasks:int;
config var size:int;

var D = blockDist.createDomain({0..size-1});
forall d in D.withGranularity(numTasks) {
    // whatever code here doing stuff with d
}

This would give users more freedom in modifying how much parallelization they want that could be affected by the data itself and the operations on said data. For example, in certain graph analyses, edges and vertices are represented as one-dimensional block-distributed arrays . These vertices and edges tend to have data defined on them that is stored locally with wherever that edge or vertex lives. Search operations happen locally, so having more tasks (more fine-graininess), increases performance. However, a kernel like triangle counting, utilizing the same block-distributed arrays as the edge list is riddled with many fine-grained communications in multilocale mode, where processing those arrays with a coarse number of tasks performs better.

To properly visualize this performance difference, below are some sample results showing the execution times varying for a function with many fine-grained tasks (triangles) and the performance getting better with more fine-grained tasks with no communication (query). The numbers shown are in seconds and the sizes of data varied. For triangles, a modesty-sized graph of ~100k edges and ~10k vertices was used. Triangles primarily iterates over the edge list so the timings reflect processing the ~100k edge list. For query, the graph was composed of ~1billion edges and ~1million vertices. Query primarily iterates over the vertex set and performed domain set operations for each iteration. So, the arrays processed by triangles are of size ~100k and those by query of size ~1million. Note that for these experiments, the execution times for query don't vary that much, but when the sizes get into the billions of elements and higher, the number of parallel processing units used begins to matter more.

number cores triangles query
16 4.45 1.17
32 4.95 1.12
64 5.99 1.11
128 7.86 1.02

Experiments were conducted on four compute nodes where each is assigned as a locale and contains 128 cores (64 per CPU) and 1TB RAM.

Sample Functionality

Source Code: Attached is a file named set_granularity.txt (extension has to be changed to .chpl) that attempts to create new iterators for block-distributed domains and arrays. The code for arrays has issues that I was not able to figure out, but the one for domains seems to work. I checked by ensuring that the unique number of task IDs being generated is equal to the number passed to the file. I simply changed every instance of BlockDom.these and BlockArr.these to BlockDom.withGranularity and BlockArr.withGranularity from modules/dists/BlockDist.chpl.

Compile command: chpl set_granularity.chpl Execution command: ./set_granularity -nl 4 --numTasks=16 --size=2048 Associated Future Test(s): Unsure

Configuration Information

bradcray commented 8 months ago

As an update to set_granularity.txt not working, I was able to determine that the problem was due to failing to get each locale's privatized version of the BlockArr/Dom class, and with @benharsh's help determined that this was due to the fact that the these() iterators go through the _array.these() wrapper, which takes care of such issues. set_granularity-blc.txt is a rewrite that avoids the locality issues in the OP. I am thinking about what more we could do here to reduce confusion and the chances for non-local accesses to privatized classes.

A slightly more detailed version of this appears in Gitter and the comments that precede it show some of the investigation that led to the conclusion.

benharsh commented 8 months ago

My initial thoughts regarding privatized classes lean towards a compiler-assisted "localize" call when crossing on-statements. This could potentially cause issues in the (rare?) case where a user does not want the local instance.

Something I've occasionally brought up is the notion of on-statements supporting a with clause to help define how variables are brought across locales. With such a feature we could define the default on-intent for privatized classes to be something like local in, and a user could override that behavior with a simple `inorconst in``.

Something a bit less impactful on the language could simply be allowing for a local intent in the "this-intent" space on methods:

proc local MyBlockArr.these(...) { ... }

These approaches rely on the compiler having some understanding of privatized classes, which does not yet exist today.

bradcray commented 8 months ago

My initial thoughts regarding privatized classes lean towards a compiler-assisted "localize" call when crossing on-statements. This could potentially cause issues in the (rare?) case where a user does not want the local instance.

I'm glad to hear you say that because that's where my thoughts have been going this week, to the extent that I think I often believe that we already have this. But I think I'm getting mixed up w.r.t. the work we did on serializing slices (still off by default).

That said, I can imagine two interpretations of your statement: 1) when a privatized class crosses an on-clause, we serialize it by passing its pid across the wire and then deserialize it by converting that pid back into the local class instance (where my head is) 2) we don't proactively privatize classes at their declaration point (like today), but just serialize the classes as they cross on-clauses (which seems attractive, but seems like it could be expensive for things like descriptors that have numLocales-sized arrays)

These approaches rely on the compiler having some understanding of privatized classes, which does not yet exist today.

I think that makes sense though and seems likely in our future—certainly the current approach isn't great and I think users will want/need similar capabilities without wanting to build them themselves. Also relates a bit to the notion of having a more straightforward way of declaring "per-locale" variables (which I think you were playing with a bit recently, and which I think Daniel's work on local static variables is going to make us want as well).

benharsh commented 8 months ago

That said, I can imagine two interpretations of your statement:

I was thinking more about option 1 because we have a better understanding of that case, but option 2 is indeed attractive in its own way. I think it would be great if we could support first-class privatization in the language that was optionally proactive or lazy.