chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

`ChapelLocale` module: The `runningTasks()` and `maxTaskPar` methods. #20219

Closed jeremiah-corrado closed 2 years ago

jeremiah-corrado commented 2 years ago

In a recent module review meeting, we looked at locale.runningTasks() and locale.maxTaskPar which query the number of tasks currently running on the locale and the locales maximum number of tasks respectively.

These are commonly used together in patterns like this:

if (here.runningTasks() < here.maxTaskPar) {
// do something in parallel
} else {
// do something sequentially
}

Or their difference (here.maxTaskPar - here.runningTasks()) is used to guide the number of tasks spawned by a parallel construct.

A new method to make the above examples simpler?

We discussed the possibility of including a new method that queries the optimal number of tasks to invoke. The name and exact implementation details were not decided.

It was also noted that such a method would be portable to GPU sub-locales, as it would simply return 1 (there should only be one "task" running on a gpu). This avoids the challenge of developing sensible gpu-implementations for runningTasks and maxTaskPar.

Should runningTasks() and maxTaskPar stick around in conjunction with this new method? If so, we had the following notes about them:

runningTasks()

maxTaskPar

mppf commented 2 years ago

A new method to make the above examples simpler?

e.g. it could be

// How many tasks should be spawned to service numIterations iterations.
// while respecting dataParIgnoreRunningTasks and dataParMinGranularity variables
proc Locale.computeNumDataParTasks(numIterations: int): int

And here is a sketch of how it could be implemented (it could call into DSIUtil functions but IMO those should call this)

``` chapel proc Locale.computeNumDataParTasks(numIterations: int): int { if numIterations <= 0 { return 1; } if __primitive("task_get_serial") { return 1; } if this.isGPU() { return 1; } var maxTasks = if dataParTasksPerLocale==0 then here.maxTaskPar else dataParTasksPerLocale; var ignoreRunning = dataParIgnoreRunningTasks; var minIndicesPerTask = dataParMinGranularity; var numChunks = maxTasks:int; if !ignoreRunning { const otherTasks = here.runningTasks() - 1; // don't include self numChunks = if otherTasks < maxTasks then (maxTasks-otherTasks):int else 1; } if minIndicesPerTask > 0 then // This is approximate while (numIterations < (minIndicesPerTask*numChunks):EC) && (numChunks > 1) { numChunks -= 1; } } if numChunks > numIterations then numChunks = numIterations; return numChunks; } ```
e-kayrakli commented 2 years ago

@mppf Your point about having such a helper isn't about deprecating maxTaskPar and runningTasks, though right?


EDIT: My use of the term "concurrent" is wrong below:

I think those two methods can use better and more symmetrical names. I am afraid I don't really have good candidates on my mind at the moment though. maxNumTasks sounds like there's a cap for number of tasks you can run within a locale, which is not the case. The cap is on number of tasks you can run concurrently. I keep thinking about maxConcurrency and curConcurrency. Typing "concurrent" definitely would slow the users down, but I'm hoping that these aren't used that much in the code to shoot for short names. "curConcurrency" is also a bit of a tongue-twister...

mppf commented 2 years ago

@mppf Your point about having such a helper isn't about deprecating maxTaskPar and runningTasks, though right?

Yes. We could consider deprecating them if we had such a helper but I don't feel strongly about it.

bradcray commented 2 years ago

[.runningTasks()] should be a paren-less method?

I don't think it should be, as this is the kind of thing that can vary dynamically from one call to the next (even with no intervening code) and, to me, the parenthesis imply this sort of thing, where you're asking a computation to be performed to determine the answer. In contrast, dropping the parenthesis makes it feel more like a static, unchanging property of the locale, which it's not.

If I were trying to make them more symmetric, I'd probably add parenthesis to maxTaskPar, though to me that is more like a fixed property of the hardware/runtime, so appropriate to make look like a field (or even to implement as one).

If we were to add helpers to compute an appropriate number of tasks, I think that should be done via a standard library routine, not something as intrinsic as a method on a locale. For example, I can think of lots of heuristics for how many tasks should be created for a given context, so would be reluctant to bake any into the locale type. Even the question "How many tasks should be spawned to service numIterations iterations?" from the comment feels a little sketchy to me in the sense of "Well, if the iterations are super-fast, you may not want to spawn any..." (I know that Michael knows this, I'm just saying that I don't think we should imply that a routine can figure out the ideal number for you). That said, I'd support creating such a utility and then using it in our parallel these() iterators (sort of like how we created the RangeChunk library, but then we failed to use that in many of our these() iterators).

If we add new helpers, I don't think we should deprecate these very basic queries, which seem to me to be far more fundamental ("How many things can run in parallel?" "How many things are running right now?"). I think maxTaskPar is used far more frequently than anyone suggesting it be deprecated realizes, by users and in our example codes (e.g., any multicore hello world program).

I also don't find the current asymmetry in the names off-putting, largely for reasons Engin alludes to: They are talking about related, but different, things: The number of parallel contexts for running tasks vs. the number of tasks that are currently running.

If I were renaming either one, I'd probably focus on runningTasks(), as it's never obvious to me whether these tasks have to be actively running (such that the value will always be <= maxTaskPar), or are simply ready to run whenever a thread becomes available, or have started running but may be blocked on some synchronization event. I'd need to read the documentation to feel confident which one it is, which suggests that maybe the name isn't clear (unless it's the first of these three, in which case maybe it's OK).

jeremiah-corrado commented 2 years ago

To pick this discussion back up, I think that the new helper method can be pushed off to post 2.0, but we should definitely land on better names for both methods if necessary.

runningTasks()

I agree with Brad's point that it's not clear from the name whether you are querying the number of actively running tasks, or the number of spawned (but potentially suspended) tasks, etc. The current documentation says: "the number of tasks that have begun executing, but have not yet finished".

To me, runningTasks() or even numRunningTasks() sounds more like the number of actively executing tasks (not including those that might be suspended/ sharing threads).

So I would propose the following alternatives:

maxTaskPar

This name might be fine; however, per Engin's suggestions above, I like the word concurrency better than parallelism here. (Since we are talking about tasks, they don't necessarily have to be running in parallel, but they are running concurrently). So here are some possible alternatives:


Overall, I'm leaning towards the following pair of names, as I think the english interpretation matches what is actually being queried pretty closely:

if here.numSpawnedTasks() < here.maxConcurTasks {
   // do something in parallel
}

I'll open a poll below to gauge people's interest in changing either of the names.

jeremiah-corrado commented 2 years ago

runningTasks()

jeremiah-corrado commented 2 years ago

maxTaskPar

bradcray commented 2 years ago

This name might be fine; however, per Engin's suggestions above, I like the word concurrency better than parallelism here. (Since we are talking about tasks, they don't necessarily have to be running in parallel, but they are running concurrently).

This doesn't seem right to meβ€”like you're arguing against your own point almost? (though it arguably depends on your definitions of parallelism vs. concurrency, which could differ from mine). If we consider time-based interleaving as a valid mode of concurrency, the number of concurrent tasks is limited only by the number we have the memory to store, isn't it? Whereas the point of this query is literally intended to be "How many tasks can the hardware run in parallel / simultaneously" e.g., without relying on time-based slicing? Now arguably, when we enable hyperthreading, we're starting down a path from parallelism to fine-grain concurrency, but to me this still feels more like actual parallelism ("How much can I run at once, simultaneously?") than concurrency ("How much can I run in an interleaved fashion?")

For that reason, maxConcurTasks doesn't seem right to me.

bradcray commented 2 years ago

The current documentation says: "the number of tasks that have begun executing, but have not yet finished".

If we keep this definition, then I don't think numSpawnedTasks() works because (to me at least) that suggests it would include tasks that have been created, but not yet started executing. That said, that query actually seems like a more valuable one to me, particularly in contexts where I'm asking "Should I spawn more tasks or have I already created enough / too many?" But if I were wanting to name that query, I'd probably just call it numTasks() for simplicity.

Googling the standard OS pictures for transitions between thread states, numRunningTasks() seems not great since our count includes those that are blocked or ready-but-not-just-created. Brainstorming, numScheduledTasks() or numActiveTasks() sounds slightly more accurate to me, but again, I mostly wonder whether we are counting the wrong things here (or not counting the thing that is people might most want to know).

@ronawho is out currently, but I'm definitely keen for his thoughts on this one given that I think he understands the task lifetimes under qthreads and fifo better than most of us (e.g., he might have a better term than "Running" or "Scheduled" or be able to say why those names are appropriate. And could also probably give a more authoritative answer as to whether what we're counting is reasonable or what I'm suggesting we might want to count is impractical.

jeremiah-corrado commented 2 years ago

Whereas the point of this query is literally intended to be "How many tasks can the hardware run in parallel / simultaneously"

I guess I was thinking of this call as answering the question "How many tasks can the hardware reasonably run concurrently (including hyperthreading)?" Where I was considering hyperthreading to be a form of concurrency rather than parallelism.

Mainly, I just associate "task" with "concurrency" in my mind, so the use of the abbreviation "par" in the name feels off to me.

But I definitely take your point that conceptually parallelism is still the thing being asked about:

but to me this still feels more like actual parallelism ("How much can I run at once, simultaneously?")

e-kayrakli commented 2 years ago

@bradcray

I don't think it should be, as this is the kind of thing that can vary dynamically from one call to the next (even with no intervening code) and, to me, the parenthesis imply this sort of thing, where you're asking a computation to be performed to determine the answer. In contrast, dropping the parenthesis makes it feel more like a static, unchanging property of the locale, which it's not.

A little bit of a general guideline question, but I think the dynamic nature of a return value of a method shouldn't impact whether it is parenful or not. list.size is parenless, but it is dynamic. The criterion we put in the style guidelines (which is totally subject to change) is if something is reasonably implementable as a field vs something that may require computation. To me, runningTasks sounds much closer to something that can be implemented as a field (an atomic field that's bumped up and down as we spawn tasks could be an implementation, for example).


Re concurrency vs parallelism: I goofed in my initial comment that brought up "concurrency" in the picture. Edited my comment. Sorry about the confusion.

This issue is thornier than I had hoped initially. Mainly because I was putting the two concepts closer than they actually are in my head, causing me to ask for better symmetry. I'm more OK with keeping maxTaskPar right now. Based on Brad's musings about tasks that are at various states, I wonder if we should go with numTasks as he suggested but add an argument to it. Something like:

enum taskQueryState { created, ready, running, blocked, .... }
proc numTasks(state = taskQueryState.created) { } // can't tell if this default 
                                                  // corresponds to what we have today

If we were to take that path, I would prefer if maxTaskPar was named maxParallelTasks. But it is not a big difference and I don't feel strongly enough to argue for making the change.

bradcray commented 2 years ago

think the dynamic nature of a return value of a method shouldn't impact whether it is parenful or not. list.size is parenless, but it is dynamic.

I agree with that principle in general, but still tend to disagree in this specific case because of how indirect the query is. E.g., for a list if I have:

writeln(list.size);
... do a bunch of stuff that never relates to `list` ...
writeln(list.size);

I'd be surprised if I got a different answer out. Yet for this query, I could write:

writeln(here.numRunningTasks());
... do a bunch of stuff that never relates to `here`, such as...
begin foo();
begin bar();
writeln(here.numRunningTasks());

and get a completely different answer out. In fact, I could even get different answers for:

writeln(here.numRunningTasks());
writeln(here.numRunningTasks());

if there were pre-existing tasks that were terminating. So, for me, it's less the dynamic nature of the variable's value that suggests using the parenthesis and more the fact that it's a value whose changes are so tied to language concepts and independent of the object to which it's being applied that make it feel more like a "go query this thing from the system [run some code]" paren-ful style to me than a "go look up this field." I admit that this is a taste issue and that there's plenty of grey area.

(I also suppose a similar argument could be made for lists (i..e., maybe there are pre-existing tasks modifying the list that would cause two list.size queries to differ), but that still seems more related to the nature of how the program is using the lists than the nature of lists and the .size query itself to me).

I wonder if we should go with numTasks as he suggested but add an argument to it.

I feel like we recently got rid of a bunch of num*Tasks() queries recently, though I can't remember whether it was because they were difficult to implement well or not used in practice or something else. But I think we should refresh our memories on that before adding the queries back in.

jeremiah-corrado commented 2 years ago

The following is a brief summary of this thread -- for my own benefit and @ronawho's (who we have asked to weight in about the runningTasks() method):

maxTaskPar:

runningTasks():

Adding a New Method:

ronawho commented 2 years ago

(note that I was typing up this comment before the above one, and didn't see it before posting -- I don't think it changes my response though)

@ronawho is out currently, but I'm definitely keen for his thoughts on this one given that I think he understands the task lifetimes under qthreads and fifo better than most of us (e.g., he might have a better term than "Running" or "Scheduled" or be able to say why those names are appropriate. And could also probably give a more authoritative answer as to whether what we're counting is reasonable or what I'm suggesting we might want to count is impractical.

I know the current state of things, but I also think we should keep our wording/naming vague in order to allow changes in the future. The current implementation is done entirely in the compiler and module code and there currently are no hooks into the runtime so this is always an approximation and was done in order to answer the question of "how many tasks should I spawn in an upcoming parallel region". I might also describe the counter as how many tasks are about to be "running" where running would include actually running, blocked, or enqueued tasks.

Most of the current implementation was done in https://github.com/chapel-lang/chapel/pull/7193 and then there's been probably half a dozen or more bug fixes because there's always one more bug for the running task count. https://chapel-lang.org/releaseNotes/1.16/08-benchmarks-opts.pdf starting on slide 47 may have some more info.

This may be incomplete, but I think we:

Some code examples (I can sketch the pseudo code for how we generate this if that'd be valuable)

writeln(here.runningTasks()); // 1

coforall 1..4 {
  writeln(here.runningTasks()); // always 4 even if some tasks have finished before the current task prints
}

// would be same for an unbounded coforall (e.g. coforall 
sync {
  for 1..4 do begin {
    writeln(here.runningTasks()); // can print 1, 2, 3, or 4 depending on task order and completion times
  }
}

writeln(here.runningTasks()); // 1
on Locales[1] {
  writeln(here.runningTasks()); // 1 and the count on Locale 0 is 0
}

The bounded coforall incrementing before starting any tasks was so that we wouldn't oversubscribe for nested task creation. i.e. we wanted

coforall 1..here.maxTaskPar {
  if here.runningTasks < here.maxTaskPar {
    begin myComputatio();
  } else {
    myComputation();
  }
}

to be consistent and not spawn additional tasks just because one task started before all the others were created. The example above is silly, a real use-case is nested foralls that limit task creation based on the amount of tasks already running.

I feel like we recently got rid of a bunch of num*Tasks() queries recently, though I can't remember whether it was because they were difficult to implement well or not used in practice or something else. But I think we should refresh our memories on that before adding the queries back in.

I think this is https://github.com/chapel-lang/chapel/pull/19465, which removed locale.blockedTasks(), .queuedTasks(), .idleThreads(), and .totalThreads()

jeremiah-corrado commented 2 years ago

Thanks Elliot! That certainly clears up a lot of my confusion.

I know the current state of things, but I also think we should keep our wording/naming vague in order to allow changes in the future. The current implementation is done entirely in the compiler and module code and there currently are no hooks into the runtime so this is always an approximation and was done in order to answer the question of "how many tasks should I spawn in an upcoming parallel region"

So are you thinking that the current name is ok? Would something even more general like numTasks() be better in your opinion?

ronawho commented 2 years ago

I'd personally be fine sticking with runningTasks() and making the documentation vague. As an example maybe in the future we'd like to remove blocked tasks from the task count and I think "running" is ambiguous enough that you could justify the name in either world.

I agree with Brad's notion that here.maxTaskPar is used a lot and I like that name, so I'd be inclined to stick with that. I think runningTasks is much less often used by end-users and so we could change if we wanted to, but none of the other suggestions feel obviously better to me. If we had something like computeNumDataParTasks I think we'd have even less use for runningTasks, which I think is almost always used to compute "how many tasks should I spawn" today.

jeremiah-corrado commented 2 years ago

I'd personally be fine sticking with runningTasks() and making the documentation vague.

Ok sounds good. Given what you've said about potential changes to the implementation in the future, I would agree that keeping it vague, is the best thing from a stabilization perspective.

I'll wait a while (at least until after Brad gets back from vacation) in case anyone else wants to chime in, but I would be inclined to make some changes to the documentation, and then close this issue. I'd also create a separate non-2.0 issue for the computeNumDataParTasks method.

Here is a quick mockup of what I'm thinking the docs for runningTasks() might look like:


proc locale.runningTasks(): int

Get the number of tasks running on this locale.

This method is intended to guide task creation during a parallel section. If the number of running tasks is greater than or equal to the locale's maximum task parallelism (queried via maxTaskPar), then creating more tasks is unlikely to decrease walltime.

Returns: the number of tasks that have begun executing, but have not yet finished Return Type: int

Note that this number can exceed the number of non-idle threads because there are cases in which a thread is working on more than one task.


The current docs for comparison: https://chapel-lang.org/docs/language/spec/locales.html#ChapelLocale.locale.runningTasks

bradcray commented 2 years ago

at least until after Brad gets back from vacation

I don't have any big concerns with leaving things as-is, but I also wasn't advocating for renaming/formatting these, so am probably not the right person to ask.

We could also mark runningTasks() unstable if we feel sheepish about keeping its meaning a bit vague and/or to reflect the approximate nature of its results.

jeremiah-corrado commented 2 years ago

I think I was initially in favor of renaming these; however, I've since changed my mind. I don't want to speak for Engin too much, but looking back through the thread, it looks like he also changed his mind from rename to don't rename--roughly speaking.

We could also mark runningTasks() unstable if we feel sheepish about keeping its meaning a bit vague and/or to reflect the approximate nature of its results.

Personally, I don't have any stability concerns wrt vague documentation. Specifically, I don't think we'd be breaking anything if this method is modified to give a more specific task count. We'd just be making it better at what it currently does. (kind of like improving the randomness of an array.shuffle() method or something)