[Discussion] Should iteration over distributed domains in shared memory emulate follow the same pattern used in data distribution?

chapel-lang / chapel

a Productive Parallel Programming Language

https://chapel-lang.org

Other

1.76k stars 414 forks source link

[Discussion] Should iteration over distributed domains in shared memory emulate follow the same pattern used in data distribution? #13322

Open LouisJenkinsCS opened 5 years ago

LouisJenkinsCS commented 5 years ago

To demonstrate what I mean by example, think of a Cyclic distribution. If a domain is domain-mapped over a Cyclic distribution, should iteration over such a domain distribute tasks in a cyclic-fashion? As in, should it both A) distribute data in a cyclic manner such that each locale owns its own stride over the domain, and B) distribute computation in a cyclic manner such that tasks also own strides over their own local subdomain.

This is related to #13305 in that I believe that this type of behavior should be supported by default. Right now, iterating over a Cyclic domain-mapped domain results in a pseudo-block distribution of work for tasks iterating over the local subdomain. I think that the tasks iterating over the local subdomain should follow the pattern of distribution they are named after, such as block (behavior we currently have right now) Block, and block-cyclic for BlockCyclic, and so on and so forth.

mppf commented 5 years ago

I'm having trouble understand what this issue is asking for. Could you show a short example?

LouisJenkinsCS commented 5 years ago

use CyclicDist;
var cyclicDom = {1..10} dmapped Cyclic(startIdx=1);
var tidCounter : atomic int;
forall i in cyclicDom with (var tid = tidCounter.fetchAdd(1)) do writeln(tid, " has ", i);

TIO

Output

3 has 1
4 has 6
3 has 2
4 has 7
3 has 3
4 has 8
3 has 4
4 has 9
3 has 5
4 has 10

Expected Output

3 has 1
4 has 2
3 has 3
4 has 4
3 has 5
4 has 6
3 has 7
4 has 8
3 has 9
4 has 10

Note that right now, task 3 has indices 1..5, and task 4 has indices 6..10. This is what I meant by 'pseudo-block distribution'; I would want task 3 to have 1..10 by 2 and task 4 to have 1..10 by 2 align 1. As in, cyclic distribution of ranges for tasks for a Cyclic distribution.

bradcray commented 5 years ago

It's not obvious to me that this is necessarily the right thing to do (which is not to say that a distribution author couldn't choose to do it). While it might help in some cases like the specific one that motivated you to post this issue, it could hurt in others, for example due to false sharing between tasks due to accessing elements in an interleaved manner rather than grabbing large chunks of contiguous items. It does suggest that a distribution's documentation should specify how local iteration is handled so that performance-minded programmers can reason about it.

LouisJenkinsCS commented 5 years ago

Gotcha. Then I suppose I should ask this: If it is not necessarily ideal to have the distribution-specific iteration schemes (I.E Cyclic performing cyclic iteration, Block performing block iteration, BlockCyclic performing block cyclic), would it be acceptable for me to propose that a specific iterator (I.E as a method, not necessarily as part of 'dynamic iterators') be added to the Chapel standard distributions? I.E, iter CyclicDom.cyclic(), iter CyclicArr.cyclic(), iter BlockDom.block(), and iter BlockArr.block(), etc. Preferably a name that was common among all distributions that just reflected the type of distribution used, maybe just chunks?

use CyclicDist;
var cyclicDom = {1..10} dmapped Cyclic(startIdx=1);
var tidCounter : atomic int;
forall i in cyclicDom.chunks() with (var tid = tidCounter.fetchAdd(1)) do writeln(tid, " has ", i);