Open lydia-duncan opened 2 years ago
Good point about optimizing for remote calls. Aggregators and remote caching come to mind. Since we want to keep up programmer productivity, it is good to encapsulate away such details.
An elegant way is with a function that accepts a closure indicating how to combine two values, ex.:
var combo = DistributedMap.createFrom(m1, m2, lambda(k, v1, v2) { return v1 + v2; });
Doing this using an iterator is currently more chapel-rific. So we will want to provide a "combining" iterator that yields tuples of key+values:
var combo = new DistributedMap(...);
forall (k, v1, v2) in DistributedMap.tuplesOfValues(m1, m2) {
combo.add(k, v1+v2);
}
or bulk-add all keys up front:
var combo = new DistributedMap(DistributedMap.commonKeys(m1, m2), ...);
forall (k, v1, v2, co) in DistributedMap.tuplesOfValues(m1, m2, combo) {
co = v1+v2;
}
Further thoughts:
If such an iterator, or a variant thereof, yields values by reference, it will be a convenient way to update two maps in parallel, an extension of #18962.
Part of the benefit of the above createFrom
or tuplesOfValues
is that they can provide aggregation+caching without users worrying about it.
Also likely useful: combo
to include keys that are in either in m1
or in m2
. The user is to specify how to calculate the new value when the key is only in m1 or only in m2.
Yeah, I think that's an interesting proposal! I suspect we'll probably want something along those lines but in addition to map methods that are present on serial-only maps. But if we provide less optimized versions that can be called from any map regardless of if it is set up for parallelism or distribution and our recommendation is to use more optimized versions, are we fulfilling the spirit of that ideal? I don't know if I have an answer for that
Good points. Ideally, the interface can be the same for any map type and there is no need for less optimized versions.
Doing this using an iterator is currently more chapel-rific. So we will want to provide a "combining" iterator that yields tuples of key+values:
Is that what zippered iteration is supposed to do?
Is that what zippered iteration is supposed to do?
Semantically, yes (*). Then, what do we need for zippered iteration to perform on par with a specialized iterator?
(*) What is the behavior of zippered iteration when the set of keys in m1
differs from that in m2
? Is that an "out of bounds" error, or keys in m1
prevail, or ... ? When we define a specialized iterator, we can allow the user to chose among these behaviors etc.
What is the behavior of zippered iteration when the set of keys in m1 differs from that in m2
I have trouble engaging with this question without delving in to
But I suspect it would be an out of bounds error.
We can view this as a generalization of building a distributed histogram, ex.
combo.bulkUpdate(inputs = m1,
input2key = lambda(item) { return item(0); },
updater = lambda(item, ref element) {
const [ref] k = item(0);
if m2.contains(k) then element = item(1) + m2.getValue(k); });
This approach lets us leverage whatever solutions we develop for distributed histograms. It doesn't (yet?) do anything smart to batch the queries to m2
.
This task also shows a shortcoming of the current state of the bulkUpdate() approach: it forces the addition of each key to the receiving map, so that a reference to the corresponding value
slot can be passed to the updater
FCF. The solution could be to add a separate bulkUpdate method that accept a predicate whether an element needs to be added for a given key (if it does not exist yet).
a shortcoming of the current state of the bulkUpdate() approach: it forces the addition of each key to the receiving map
Here is another solution: have the updater
FCF accept a "callback" object instead of the element reference. The FCF invokes API methods on the object to query the presence of the element, add it, remove it, initialize it, update it, etc.
The next step is to figure out how to deliver this solution to the user, i.e., to make it easy for the user to transition from traditional code to code manipulating the callback object.
We may want to define three overloads of bulkUpdate
:
A variant of this idea is discussed in https://github.com/chapel-lang/chapel/issues/12306#issuecomment-1092263968
This issue is part of a series of issues to design the interface for collections across serial, parallel, and distributed contexts. Our goal is to determine what we think is reasonable to support and what doesn't seem useful. Additional patterns welcome (but it would be best to put them in their own issue for discussion, likely)