Closed stonea closed 1 year ago
It makes sense to me to remove these and instead rely on other modules like Sort and Search to implement these.
I'm fine with having them no longer be defined on arrays themselves as long as their replacement is made available at the same time.
I would definitely deprecate these, and am not sure I agree with Lydia's sentiment that replacements for them are strictly necessary. Specifically:
So I'd be more inclined to simply deprecate them (I'm not convinced they're things we want to be in the business of supporting), potentially giving hints as to how to rewrite them, or inviting users to pester us if they want to advocate for retaining any of them.
find() is kind of goofy as currently written ("iterate over the array serially...") and would be better expressed as a reduction
That's interesting but I don't think we currently have a standard-module provided reduction that would do that, right? I guess you could write some combination of a forall expression over indices and minloc
/ maxloc
? It seems to me that if it were an operation we thought were common we might want to provide a reduction that does it.
It seems to me that if it were an operation we thought were common we might want to provide a reduction that does it.
That was my thinking as well. I mostly think that providing a find()
in a parallel language where the implementation is serial seems like a bad starting point, which is why I'm guessing nobody relies on it, and why I'm not keen on committing to interruption-free support of such a pattern. That's why I was suggesting deprecating it, inviting users to argue that it's important, and only implement it if anyone does.
I've made a branch where I deprecate these functions and tried running paratest on it. These are the tests we're going to have to update once we realize this change:
@stonea: That's more failures than I was guessing we'd see. Do you happen to have the .log
file around for this run so I could see the failures in more detail? (e.g., I'm wondering whether there's a common code path or two that accounts for a large number of the failures). For example, it appears that associative domains/arrays may rely on one of these routines and account for the lion's share, such that a small edit or two to that module could reduce the number of failures greatly.
sorted
I'm taking the approach now of commenting out one function at a time and then manually looking at the errors to see the patterns.
Starting out with sorted
here's the failing tests:
Seems like an "easy" way to update the tests would be to just add an iterator like this at the bottom of each one:
iter sortedArray(A) {
use Sort;
var copy = A;
sort(copy);
for ind in copy do yield ind;
}
And then do something like sed/s/sort\((.*)\)/sortedArray\(\1\))/g
to rewrite each one (I haven't verified if this is the right regexp, but you get the idea).
I started manually updating the first couple of tests and hit upon some problems:
Before proceeding I kind of want to get feedback about if this makes us reconsider removing this method (or if we think adding a sorted iterator to the sort module makes more sense).
reverse
For reverse
the test fallout isn't too bad:
arrays/diten/arrayReverse
arrays/userAPI/arrayOps2D
constrained-generics/basic/set2/resolve-implements-types
Reconstructing reverse()
manually isn't totally trivial as it works by updating the array in
place rather than cloning it, iterating through the original array backwards, and copying it in.
Here's what it looks like as a standalone function is
proc reverse(ref A) {
const lo = A.domain.low,
mid = A.domain.sizeAs(A.idxType) / 2,
hi = A.domain.high;
for i in 0..#mid {
A[lo + i] <=> A[hi - i];
}
Anyway I have a commit on my cloned repos where I update the tests here, I can turn this into a PR later:
https://github.com/stonea/chapel/commit/4a4310ced2b7d60dea177a9c09320a91afa09297
The only part I'm questioning has to do with resolve-implements-types.chpl
. I think the interface of this test just needs to use some arbitrary method on the array type so I changed it from reverse
to isEmpty
.
find
For find
there's a moderate number of failing tests:
@bradcray, when you mention replacing calls to find()
with a reduction were you envisioning a reduce intent in a forall loop, something like this:
proc find(ref A : [], const ref needle) {
var foundAtIdx = max(A.eltType);
forall i in A.domain with (min reduce foundAtIdx) do
if(A[i] == needle) then foundAtIdx = i;
return (foundAtIdx != max(int), foundAtIdx);
}
Or were you thinking we should create a new kind of reduction so that users could express it with something like this:
x = findloc(needle) reduce zip(A, A.domain);
I'm running into difficulties trying to figure out how I could do that. The closest I could come up with is something like this where I hard-code in the value we're searching for:
class findloc: ReduceScanOp {
type eltType;
var needle = 30; // How can I not hard code this?
var value = max(int);
proc identity return max(int);
proc accumulate(x) { accumulateOntoState(value, x); }
proc accumulateOntoState(ref state, x : eltType) {
if x[0] == needle then
state = x[1];
}
proc accumulateOntoState(ref state, x : int) {
state = min(x, state);
}
proc combine(x) { accumulateOntoState(value, x.value); }
proc generate() return value;
proc clone() return new unmanaged findloc(eltType=eltType);
}
var A : [1..5] int = [10, 20, 30, 40, 50];
writeln(findloc reduce zip(A, A.domain));
If I simply add an init method like this and update the clone method like this:
proc init(type eltType, needle : eltType) {
this.eltType = eltType;
this.needle = needle;
}
I'll get these errors:
foo.chpl:29: error: unresolved call 'findloc.init(eltType=type 2*int(64))'
foo.chpl:6: note: this candidate did not match: findloc.init(type eltType, needle: eltType)
foo.chpl:29: note: because call does not supply enough arguments
foo.chpl:6: note: it is missing a value for formal 'needle'
So I'm not sure what the Chapeltastic way to do this is.
For reverse the test fallout isn't too bad:
When I saw you started with the sorted case, my first reaction was "That sounds like the hardest one to start with!" :D
Focusing on find()
for the time being:
Looking at find()
with fresh eyes today, I don't find it as objectionable as the others. Specifically, sorted()
seems annoying because it brings Sort
into every compile that uses an array (which is all of them) and therefore more appropriate to have live in the Sort
module (as a tertiary method? or just a standalone routine?). reverse()
seems annoying in its 1D-ness. find()
and count()
merely feel superfluous to me, but not as inherently objectionable, and I admittedly have more parallel experience than many new users. It's also notable that they're not 1D-specific. It may be that the serial implementation was the most objectionable part to me, or that my unhappiness with the first few bled into find()
and count()
as well (?). And there is something to be said for supporting similar routines on arrays as strings when appropriate.
In suggesting a reduction be used instead, I didn't mean to imply that we needed to come up with that reduction ourselves or even necessarily give users a replacement for find()
(instead, if we were to get rid of it, I'd just deprecate it and then help anyone who comes asking for a find
).
I think the following reduction is pretty close to implementing the body of find()
:
...maxloc reduce zip(A == needle, A.domain)...
where the A == needle should result in a (conceptual) array of booleans and (I think?) our maxloc
guarantees returning the earliest index in the array? (so, the first true or "1"). If this were the body of find
, some minor work might need to be done to make the result of the reduction match the interface of find()
(?).
Again, looking at this again today, I'm wondering whether, if find() had been written this way (and count as + reduce (A == needle)
), I wouldn't have been so likely to throw it under the bus above. It might be worth asking @mppf since he seemed interested in getting rid of these before I weighed in.
If we were to remove find(), I think most of the tests are pretty easy to clean up:
Any userAPI
tests that I wrote (which I think is most / all of them) are testing the user API for arrays/domains/whatever, so the "fix" for them would be to remove the calls to these routines from them since the routines will no longer be part of the user API and to update their .good files . Similarly, arrays/diten/arrayFind
looks like a test added when find
was added, so should similarly be removed.
performance.chpl
seems to be using find()
as something of a heavyweight contains()
?
I'm guessing the find() in modules/dists/SparseBlockDist.chpl
is the cause of the last four failures and is commendable in its use of it since... checking what Block does, it has a little homegrown utility to do the same thing (though it probably also predated find()
:
proc Block.chpl__locToLocIdx(loc: locale) {
for locIdx in targetLocDom do
if (targetLocales[locIdx] == loc) then
return (true, locIdx);
return (false, targetLocDom.first);
}
This feels like a common idiom ("Where is my locale in the set of targetLocales?") so ideally something we'd write once and reuse everywhere rather than re-implementing for each distribution.
Of course, if we were to keep find(), we should just have Block
use it instead.
Back on sort(), rather than looking at just removing it, I'd see whether we can move it to the Sort
module somehow (which is a package module, so arguably just kicking the question down the road... but it addresses my main objection to the method by only bringing in Sort
when a user wants to do some sorting). Ideally that would mean not changing the tests, or maybe just not changing them much (?).
Gotta run, hope I edited this into a reasonable form before timing out.
Count's implementation is a simple reduction so it would be pretty easy to "manually inline" this wherever it's used:
proc count(val: this.eltType): int {
return + reduce (this == val);
}
We get a moderate number of failures if I just remove the method (mostly under userAPI
):
I'll have to wait until post code freeze to merge anything but I have PRs for these changes ready to go:
Seeing #20140 get merged makes me realize that my "Hmm, should we really be removing find()
(and possibly count()
) rather than just improving their implementation...?" musings above never really generated any discussion or response. I guess technically I didn't ask any questions, but if it wasn't clear... I gained some cold feet when thinking about what it would take to implement efficiently (both in the sense of "seems tricky for a user" and "if this is the trick, maybe we should just do it."
Ditto for count()
, though to a much lesser extent (in part because it's simpler to write, in part because I think it's less frequently used).
reverse()
and find()
I don't have any qualms about since they're so 1D-specific.
Hi Brad, what do you suggest we do at this point? Certainly I can add find
and count
back in, perhaps marking them as unstable.
My argument for keeping them removed would be:
Anyway here's what I think the most viable options are:
find
into the Array datatypefind
and count
into the Array datatypefind
(and I suppose count
too) in some other module, say ArrayUtils
If we reintroduce find
we have the option of:
If we do reintroduce any functions to the Array type then we'd have the option:
I'm not sure what we should do, mostly because I don't know whether the general sense on the team is "No, we should definitely be getting rid of all of these" or whether people were fond of some and not others, and I/we (by some definition of "we") railroaded them into rejecting them in the last module review where we discussed them.
I don't feel as though these routines need to be replaced or the PR reversed immediately or anything like that (deprecation isn't removal after all), but after our exchanges and fact-finding on this issue, I did come away feeling like it was less obvious what we should do, and therefore likely worthy of more discussion.
I don't think we should re-introduce find() with its old implementation in any case.
Part of me agrees with the "slim interface" concept, but part of me also thinks of arrays as being somewhat string-like at times and/or strings as being somewhat array-like at times, so when there are things like .find() that seem like they would make sense to support uniformly across types rather than using a built-in method in one case and a module-based standalone function in another. I also worry a bit about creating an ArrayUtils module if it had only 1-2 things in it.
As a starting point, @e-kayrakli works a lot with strings and arrays, so I'd be curious where he stood on the spectrum of arrays should definitely vs. definitely not support .find()
(just to pick a starting case).
I don't feel as though these routines need to be replaced or the PR reversed immediately or anything like that (deprecation isn't removal after all), but after our exchanges and fact-finding on this issue, I did come away feeling like it was less obvious what we should do, and therefore likely worthy of more discussion.
Fair enough, I'll bring this to discussion. My only worry would be if we're left without a consensus before the next release, things get left deprecated by default, and then later we decide to un-deprecate it -- I'd rather not make things so unpredictable to our end users. Anyway I suppose if that becomes imminent I can flip to unstable
but I expect we'll have converged on a decision by that time anyway.
My only worry would be if we're left without a consensus before the next release, things get left deprecated by default, and then later we decide to un-deprecate it -- I'd rather not make things so unpredictable to our end users.
That's why I brought it up as soon as I saw that this had been merged. :)
As a starting point, @e-kayrakli works a lot with strings and arrays, so I'd be curious where he stood on the spectrum of arrays should definitely vs. definitely not support .find() (just to pick a starting case).
I am on the "definitely not" camp.
I see your point about arrays and strings being closer to one another, but it is not enough to convince me that we need a find
on arrays as a method. This is somewhat subjective, but I don't view strings and arrays to be that close. In the past, we removed list-like functionality from arrays due to various reasons and arrays are more list-like than string-like to me.
I agree with Michael that we have better places for all of these functions (Search and Sort modules)... except for reverse
. I am not sure what to do with it. But I can comfortably call it "let's remove until someone asks for that functionality", without looking at tests for potential use cases.
One counter-argument to keeping them (I can't see it being brought up on a quick look) is that efficient implementation of some/all of these could be dependent on how the array is implemented. A method would be the most natural way of adding these functionalities without exposing too much about the array interface. But I think it is a good exercise for us as language designers to come up with strategies that can allow efficient implementations by third parties by means of nice array interface and efficient tools (like reduce
). There's always a next application that requires a more esoteric function to be run on arrays efficiently, after all. It is hard for me to gauge where we should draw the line if we start adding (read "keep supporting") methods like these to the array interface.
We discussed whether or not find
and count
should be deprecated on the Array
module in the module review meeting today (7/7/2022).
In attendance: @e-kayrakli, @daviditen, @brandonneth, @mstrout, @ronawho, @ShreyasKhandekar, @DanilaFe
Other people who come to mind that may be interested include: @bradcray, @mppf, @lydia-duncan
Here's my summary of what we discussed. Please speak up if you think I'm missing something important, misremembering something, or have other comments.
Michelle was concerned that these functions would require that datatypes stored in Arrays to support the ==
operator (since find and count rely on it). However, this concern was alleviated by the fact that failing to implement ==
(on a datatype stored in an array) would lead to errors at compile (not runtime) and would only occur if the user actually calls .find()
or count()
.
Others (I believe Daniel, Brandon, and Shreyas) voiced that the liked having the convenience of functions like this around (relating it to experiences in other languages). There's a concern that it might not be obvious how to write these reductions (especially to new users) and it might be easier to read code that calls out to named functions rather than using the reductions directly.
Daniel also voiced a concern that if we put this in another module it would be especially inconvenient if this module weren't part of the standard library (and was required to be installed, perhaps by package manager). Our current working proposal (I go into details about this later in this post) is to put find
and count
in the sort
module. sort
is a package module meaning it's possible in the future we might intend for users to have to explicitly install these package via Mason. If that happens it seems possible (likely?) that the search
module would be made a standard module, however.
Counter to this others (Engin and me) were concerned that this could lead to a "slippery slope" of what to include and not include in Arrays.
Elliot brought up the point that if the find
and count
function exist on other fundamental datatypes (such as map
or list
) we'd probably want to include them for consistency. If we did this we might also want these methods to have a consistent interface in terms of return values and arguments.
Looking at documentation I do see find
and count
on List
but not Heap
, Map
, or Set
. Count
is also on String
as is Find
although the interface is difference (Array's count returns a tuple whereas String returns the value -1 if it can't find). String also includes rfind
which is not on any of these other datatypes. @ronawho, given that I didn't look this up during the meeting would change your vote (if find
/count
continues to exist in the List
datatype)?
I held a vote to see people wanted to move it available a separate package. Here are the results:
find / count should stay on Array datatype (no votes)
find / count should be deprecated (and we don't need to supply them today) (2 votes)
find / count should be removed from array datatype but we need to have them elsewhere (6 votes)
One argument for moving them to a new package is that doing so would allow us to leave these functions unstable for Chapel 2.0 and thus could later change their interface (or bring them back to the Array datatype if we heard demand for it). This argument swayed the opinion of some of the folks who otherwise would have wanted them to stay on the Array datatype.
The two votes to deprecate said they would also be ok moving them to a separate module so I think this is the working consensus.
As far as where to put it:
Engin suggested adding this to the existing search
module. I didn't hear any objections to this. There was also a question of if these should be tertiary or free functions. I suggested we keep them as free functions (like others in the module) and didn't hear any objects to that.
The current working proposal is:
These functions should be reintroduced as free functions into the sort
module.
EDIT: These should go to the search
module (not sort
)
We should also:
Consider if we should remove find
and count
from List (pinging @e-kayrakli on this since you did the module review for List).
Note that count
shouldn't matter on Set
, since I believe Sets can only store each value once? Count would maybe be interesting on Map w.r.t. the values stored, but would probably not be at all cheap to computer
Our current working proposal (I go into details about this later in this post) is to put find and count in the sort module. sort is a package module meaning it's possible in the future we might intend for users to have to explicitly install these package via Mason. If that happens it seems possible (likely?) that the Sort module would be made a standard module, however.
Search
module, not Sort
.
Consider if we should remove find and count from List (pinging @e-kayrakli on this since you did the module review for List).
This is the part that can sway me towards having these in the array type a bit. I don't think we should remove them from the data structures. If anything, we should add them to others that have them missing if it has a sensible meaning for the specific structure. Even with that thinking, I am reluctant to keep these in the array interface.
Note that count shouldn't matter on Set, since I believe Sets can only store each value once? Count would maybe be interesting on Map w.r.t. the values stored, but would probably not be at all cheap to computer
std::set
has count
, AFAIK. And returns 1 or 0. I believe this has more to do with the ability to swap data structures (I doubt C++ STL does a good job of that), or having familiar interfaces.
String also includes rfind which is not on any of these other datatypes.
This is a good argument against trying to draw parallels between array and string/bytes interfaces. Another point is that find
/count
on string/bytes does substring counting and finding, as well. I think that justifies having them on those types, but I don't think any of us wants to do sub-array finding/counting on Chapel arrays as part of the built-in interface. And this also ties in to the "slippery slope" argument. If we have a find
for an element of the array, why not an overload that takes another array as an argument and do a sub-array find?
Search module, not Sort.
That's a relief. Reading the comments in order, I couldn't really think of what would cause me to look for these in Sort. :)
Assuming we keep these in some form, I think my preference continues to be to define them on the array type itself (as primary or tertiary methods) rather than standalone functions, even if there is an asymmetry with the details of what find()
does on string, bytes, or list, because I think it results in code that's more pleasing looking. And I'd probably keep them in ChapelArray from the perspective of convenience (no need to rely on another module) and a sense that these few more functions feel not so different from supporting .contains()
on ranges or domains, count
/find
on strings/bytes, or other simple membership-style queries.
Philosophically, I think arrays are a very powerful, core type in Chapel, and it's not obvious to me why they shouldn't support these queries. Or to turn that argument around, why wouldn't we put all count/find capabilities into Search?
All that said, I could live with what the majority voted on above, it's just not what I would choose.
One argument for moving them to a new package is that doing so would allow us to leave these functions unstable for Chapel 2.0
I don't think this quite follows. We can definitely mark things as unstable within the internal / standard modules for Chapel 2.0 as well. They don't have to be in a package module to be considered unstable.
I don't think this quite follows. We can definitely mark things as unstable within the internal / standard modules for Chapel 2.0 as well. They don't have to be in a package module to be considered unstable.
Good point, I might be misrepresenting / misremembering what the argument was.
I'd say despite that it's still worth minimizing the number of things marked unstable in a module we're trying to stabilize.
Rather than having individuals functions in a module unstable, it would be a lot easier if we could tell users that the Array interface is stabilized but certain modules (such as sort
) are not. Although it looks like we've already have Array.equals() marked unstable so I guess we're past that being a possibility.
(such as sort)
Just to make sure we're on the same page—are we really talking about the Sort
module, or is it Search
as Engin said above? (I still don't see anything about count/find that feels sort-like, but search-like makes sense to me).
I'd say despite that it's still worth minimizing the number of things marked unstable in a module we're trying to stabilize.
I sort of agree with this, but not completely. In part in this case because we don't really expose ChapelArray as a module to the user even though it's implemented as one. We could potentially substitute "everything in the Arrays chapter is stable" except that we've already agreed that sparse arrays are not on deck for stabilization for 2.0 due to lack of time, so even discounting equals() (which is admittedly a bit of an oddball), I think we're far from accomplishing that, at least without breaking sparse arrays into their own chapter (which we could also do).
But I mostly want to come back to a question that I didn't intend to be rhetorical. Restating it: Let's say that I read the notes from the meeting and thought "This is exactly the right path! In fact, we should move string.find()
, string.rfind()
, string.count()
to Search
as well, and make them standalone routines rather than methods given that they're all forms of searching and would make the string type slimmer (and maybe also startsWith
and endsWith
since they're search-like? and/or replace
, since find and replace are usually adjacent in application contexts?). Would your and Engin's reactions be "Yes, we're all in for that as well!" or not? And if not, why not?
I'll stop trying to equate strings and arrays since that didn't really fly for Engin, but I'm not seeing why we wouldn't either (a) put all find/count routines into a single module with a single call style or (b) define them as consistently as possible as methods on the types for which they make sense rather than defining some of them one way in one place and others in another way in another place.
I don't think the argument can be bloat: To my count strings have ~50 documented methods on them to arrays' ~30. And I think of arrays as being a far more core/crucial/unique type to Chapel than strings, so arguably deserving of appropriately designed features.
I also don't buy the slippery slope argument that if we had a single-element array.find(), we would need to have a multi-element array.find() for consistency. We can just decide not to do that if we don't find it compelling / part of our vision for arrays, or decide to add it if we did.
[Note that I probably sound far more zealous about this than I probably really am, but it stood out for me that the three newest people to Chapel in the meeting—Daniel, Brandon, and Shreyas—found the current placement the most intuitive and productive, which gave me significant pause since arguably we're trying to appeal to new users. And makes me want to argue their side to make sure they're not getting written off due to being newer to the project.]
Just to make sure we're on the same page—are we really talking about the Sort module, or is it Search as Engin said above?
Yes, sorry, don't know why I keep mixing those up (though I suppose sort
is also a package module.)
In fact, we should move string.find(), string.rfind(), string.count() to Search as well, and make them standalone routines rather than methods given that they're all forms of searching and would make the string type slimmer (and maybe also startsWith and endsWith since they're search-like? and/or replace, since find and replace are usually adjacent in application contexts?). Would your and Engin's reactions be "Yes, we're all in for that as well!" or not? And if not, why not?
Yes, to be consistent I would have to be for that: add find
, rfind
, and count
to the search
module and ensure they work for all our fundamental datatypes.
Am I enthusiastically saying I'm "all for that!". Well, to be frank, with most of these design decisions I don't care so much if we choose to "paint the car red" or "paint it blue" so much as ensuring whatever decision we make gets applied consistently. I care more about consistency / predictability than the specifics of our organization scheme.
So what should our vision be then?
If it's "we want fundamental datatypes to have every tool in the toolbox ready for the user to use, and not 'hidden' behind some module" then we should add rfind
to array and we need to start some effort to ensure these methods have consistent interfaces. I'd say we also remove the sort
module and add .sorted
back to array and add it to list
, string
, etc.?
If it's "we want fundamental datatypes to be lean and have extra functionality appropriately ' then we should start trying to make cuts to other datatypes like string
, list
, map
, etc.
If it's "we don't know what we want" then my inclination is to be as conservative as possible and try and minimize what we put in Chapel 2.0. I want to avoid having fundamental datatypes with a lot of functions marked unstable so separating things out into modules we can call "more experimental" makes sense.
I'll stop trying to equate strings and arrays since that didn't really fly for Engin, but I'm not seeing why we wouldn't either (a) put all find/count routines into a single module with a single call style or (b) define them as consistently as possible as methods on the types for which they make sense rather than defining some of them one way in one place and others in another way in another place.
Ok, sounds like we're saying the same thing. Forced to decide between (a) and (b) my vote for would be for (a). Why? My design sensibilities (at least of today) lean towards classes being small and simple.
Why? Well, if you want me to get really self reflective about it I'd probably say it has to do with a combination of past experience having to maintain code with classes that aren't like that and being influenced by books / blogs / code review comments / the like about that how classes should be simple and have single responsibilities.
I don't think the argument can be bloat: To my count strings have ~50 documented methods on them to arrays' ~30. And I think of arrays as being a far more core/crucial/unique type to Chapel than strings, so arguably deserving of appropriately designed features.
But maybe that means strings are too bloated? In terms of arrays deserving more "appropriately designed features", I'd agree with that, but I don't think "appropriately designed" necessarily means "available as methods on" unless we decide that's a design decision we want to consistently apply.
Note that I probably sound far more zealous about this than I probably really am, but it stood out for me that the three newest people to Chapel in the meeting—Daniel, Brandon, and Shreyas—found the current placement the most intuitive and productive, which gave me significant pause since arguably we're trying to appeal to new users. And makes me want to argue their side to make sure they're not getting written off due to being newer to the project
I think that's a fair point. Especially if our vision is that we want Chapel to be "as easy to use and with the same 'Batteries included' philosophy as Python" but with a focus on HPC.
I think their concerns had more to do with not deprecating and removing these functions more than their exact placement, but I don't want to be speaking on their behalf (so pinging @DanilaFe, @brandonneth, @ShreyasKhandekar in case they want to speak up about it).
--
If you'll excuse me getting a little ranty at the end here:
One thing I find frustrating with these design reviews is it seems like we don't have a clear vision for what we want. It's not that I think we need a top-down approach to defining that vision but rather I want something written down. I suppose we have: https://github.com/Cray/chapel-private/issues/1478 and https://github.com/chapel-lang/chapel/blob/main/doc/rst/developer/bestPractices/StandardModuleStyle.rst but they're not being actively updated and referenced and the "guidelines" document is more a list of questions / things to consider than prescriptive advice.
There's a lot of cliches about group-led design leading to inconsistent results (think: "too many chefs spoil the broth" or "a camel is a horse designed by committee"). I don't bring this up to be harsh or pessimistic or say that I think that Chapel is poorly designed (or that Camels are for that matter :-)) but rather to ask if in the absence of having some kind of codified guidelines / vision you feel comfortable that we're guarding against the sort of inconsistent design that these cliches are used to critique.
[Note that I probably sound far more zealous about this than I probably really am, but it stood out for me that the three newest people to Chapel in the meeting—Daniel, Brandon, and Shreyas—found the current placement the most intuitive and productive, which gave me significant pause since arguably we're trying to appeal to new users. And makes me want to argue their side to make sure they're not getting written off due to being newer to the project.]
First, I'll just second this to be sure. I am against keeping these in the array interface. But if my disagreement sounds or sounded like ignoring your thoughts, that's definitely not my intention. TBH, it is more valuable that you are new to the project and our bubble that some of us have been living in for years.
And before responding to Brad's other points let me re-state that Elliot's point about other data structures and find
/count
being primary methods on them and not on arrays (potentially) have been making me think harder lately. And I can admit that I am not thinking as strongly as I was before.
To the other points:
But I mostly want to come back to a question that I didn't intend to be rhetorical. Restating it: Let's say that I read the notes from the meeting and thought "This is exactly the right path! In fact, we should move string.find(), string.rfind(), string.count() to Search as well, and make them standalone routines rather than methods given that they're all forms of searching and would make the string type slimmer (and maybe also startsWith and endsWith since they're search-like? and/or replace, since find and replace are usually adjacent in application contexts?). Would your and Engin's reactions be "Yes, we're all in for that as well!" or not? And if not, why not?
I find find
et al to be a more crucial feature for strings than they are for arrays. However, trying to be more introspective, I think this is because I am more of a C/C++ programmer where arrays are just a blob of memory that you can index into, and not really a python, ruby, rust etc etc programmer, where data types have more functionality. Though, again, this is more introspection than a serious argument. If anything, we probably want to place ourselves closer to the second family of languages there.
Another point of introspection for me is this thought experiment: if we were to jam these methods in another internal module like ChapelArrayUtils
, which is public use
d from ChapelArray
, I'd be more comfortable with keeping them in the interface. This is clearly a question of code organization and not interface design. But, part of my objection is probably seeing these "high-level" functionality on a "low-level" module. Which, again, is clearly not an argument for this discussion.
I don't think the argument can be bloat: To my count strings have ~50 documented methods on them to arrays' ~30. And I think of arrays as being a far more core/crucial/unique type to Chapel than strings, so arguably deserving of appropriately designed features.
I also don't buy the slippery slope argument that if we had a single-element array.find(), we would need to have a multi-element array.find() for consistency. We can just decide not to do that if we don't find it compelling / part of our vision for arrays, or decide to add it if we did.
I think these are related. Your arguments don't elate my worries too much. If we have find
on arrays, arguably there might be a new method down the road that we might want to consider adding. In such scenarios, it is not that hard for me to imagine that this new method would be something that we want to have in our standard libraries instead of the array interface as a primary method because it doesn't feel as crucial as find
but yet important for multiple users etc. Then we'll have to draw a line somewhere, and my thinking is, the line is clearest if we call it "no algorithmic utilities on array interface".
As for the bloat, I don't think it's about number of methods. It is difficult to imagine the complexity it would take for implementing these methods on the array interface. Granted the methods we're discussing right now are one-liners, but will the next one be as concise?
To Andy's points (which came in as I was writing my comment above): I am probably not as aggressive as you are in terms of moving everything in other utility modules. As for the lack of guidelines: part of the 2.0 effort is (or to me, should be) to come up with those guidelines. I don't think we can come up with good guidelines without going through these sometimes painful discussions. But how much are we trying to draw general conclusions from these discussions to flesh out the guidelines more, I do not know. Probably something we can improve somehow.
[Sorry in advance that this is so long]
Let me start with Andy's ranty bit and turn it into a question: I find the design process about as frustrating as anybody, particularly since I tend to need to get involved in most of the design discussions whether or not I'm leading them, but it's also not clear to me how to improve it (and I hope I've been clear that I'm open to doing so if we have some good ideas). E.g., it's pretty easy for me to imagine establishing a convention like "class types in the standard libraries should start with capital letters" but much harder for me to imagine delineating guidelines that would answer a question like "should a given type support a given feature by default?" without the need for discussion. Since you have experience with other projects and groups that I don't, do you have insights into things we ought to be doing here to address your concern, but are not?
(I also agree with, and appreciate, Engin's point that part of the 2.0 effort is intended to establish better for ourselves what these guidelines are. But even then, I'm unclear on how we would codify them to cover future design questions that might come up).
And then to add my own rant: Part of my frustration on this issue is that I'd intended my comment on June 2 to be read pretty clearly as "Whoa, wait, I'm getting cold feet on this... are we going too far?" But then rather than generating further discussion or consensus-building, the next action I was aware of was the PR that did the deprecation. Maybe I just didn't voice my concerns clearly enough to make it clear that they were actually concerns and not just me sitting around playing devil's advocate? And I suppose it could be argued that I was being inefficient in the first place by second-guessing an intention that we'd discussed and agreed upon in a deep-dive, but I also think it's unrealistic to assume that we'll always make the right decision when something is discussed for the first time (to my memory) before investigation is done.
Returning to the technical discussion:
Trying to wrestle more with what makes me wonder whether we should be keeping rather than deprecating _array.find() and/or .count, here are some thoughts that occur to me, some of which are likely to be repeats:
I suppose these bullets could be interpreted as starting to develop a (personal) philosophy that could serve as future guidelines, but in many respects they seem pretty specific to arrays. For example, I'm not sure that from them we could make better decisions about what locales or bigints or heaps should support.
I appreciate that you agree that treating these methods as similarly as possible on arrays and strings makes sense [Andy, at least. I guess Engin didn't seem convinced]. If we agreed as a team to do that, it would make this decision much easier for me because I wouldn't want to have to use a module like that to get find() on strings, nor would I want find() on strings to turn into a standalone, rather than method-based routine. So it would break my personal uncertainty on _array.find() by raising the spectre of making the string case so much less attractive that I'd definitely vote hard for "Let's just agree to support .find on arrays and strings and lists and other appropriate collections by default then." Maybe this suggests a next step here: To see whether the team:
(another possible next step is to make it obvious that I'm in a distinct and tiny minority and wasting everyone's time by belaboring this conversation... but the readout from the discussion and Andy's reaction to the string suggestion suggests to me that it's not quite that bleak).
Having typed all this, I'm realizing that part of my reason for not wanting a Find
or Search
module that defines list.find
, array.find
, string.find
, etc. is that I feel as though for many of these types, an efficient implementation would likely want to interact with the type's internal storage/structure rather than being implemented through its public interfaces (I think it's definitely the case for strings, somewhat the case for maps, much less the case for arrays and possibly bytes). Ironically, it was the perceived simplicity of implementing these using public interfaces on arrays that motivated us to consider deprecating them in the first place, and I'm not quite sure what to make of that other than "Maybe arrays are so structurally simple and/or tuned via their public interfaces that it makes more sense of them than others?" So maybe a principle here is "If the routine is likely to want to access the type's internal structure to get an efficient/parallel implementation, it should be part of the type?"
There also seems to be a challenge to "put all find() methods into a Find/Search module" philosophy in that it's not extendsible to new collection types created by a user.
One other observation would be that if we had interfaces implemented, perhaps we'd think of these things in terms of being part of a "Findable()" interface and then think about whether a given type does or does not fulfill that interface. And then define the set of possible interfaces that our built-in and/or standard types will support and those that are definitely off the table (e.g., Sortable).
I also appreciate your introspection about why you're inclined to move them out. I think my rebuttal there would be that while, as Chapel implementers, we see these things as classes or objects because we're building them, to the user they're more like intrinsic types, so it gets into a question of whether it's better/nicer/more convenient to have feature-rich built-in types or feature-poor types. Put another way, for this question I think we should focus most on what's attractive/useful/productive to users rather than what's easiest for us when defining the built-in types from a SW engineering perspective. The fact that (I think) these routines have fairly straightforward implementations also makes the code creep issue seem less problematic to me (whereas sorted
required dragging in the whole Sort
module which seems like an obvious red flag).
if we were to jam these methods in another internal module like ChapelArrayUtils, which is public used from ChapelArray
I don't like this approach better than the extremes. As a user it means that I still get the feature by default, but don't see it documented alongside other array capabilities when I'm reading the language spec (or, if I do, then it completely doesn't matter to me as a user so seems no better than simply putting it in ChapelArray). And in either case, as a developer, I'd similarly rather see all of the "available by default" methods on arrays defined in a single place rather than learning which file to look in for which features. If we did unify the documentation, it's another hassle to wrestle with making happen; and if we didn't, it's a slightly clunky module in a list that I'd prefer not to have/start.
Your comments generally resonate with me and as I was mentioning before, I am much closer to the fence (not to imply that you should try harder to convince me, I just need to mull on it. And I wouldn't hold progress on this just for my sake.)
Having typed all this, I'm realizing that part of my reason for not wanting a Find or Search module that defines list.find, array.find, string.find, etc. is that I feel as though for many of these types, an efficient implementation would likely want to interact with the type's internal storage/structure rather than being implemented through its public interfaces (I think it's definitely the case for strings, somewhat the case for maps, much less the case for arrays and possibly bytes). Ironically, it was the perceived simplicity of implementing these using public interfaces on arrays that motivated us to consider deprecating them in the first place, and I'm not quite sure what to make of that other than "Maybe arrays are so structurally simple and/or tuned via their public interfaces that it makes more sense of them than others?" So maybe a principle here is "If the routine is likely to want to access the type's internal structure to get an efficient/parallel implementation, it should be part of the type?"
This implies that we should also extend the optional doi*
interface to include doiFind
, doiCount
and whatever we may want to add to the array interface. Which, when I think about it, is just a natural part of having these functions as part of the array interface.
Starting from the extremes, things that seem more structural seem like obvious inclusions: .size, .dim(), and friends feel like they should clearly belong on the array type rather than being available via a distinct module
Agreed. Querying size seems like asking a fundamental question about the array itself that almost any large Chapel program is going to have to do several times. I suppose one could make the argument that this is really a property of the domain but querying size is such a common thing to do that forcing users to go through an extra step of querying domain seems needlessly.
reverse() isn't quite as algorithmic, but seems very specific to 1D (and possibly dense?) arrays, which is a red flag for me in that I'd prefer most automatic features on arrays to work on most array types (where obvious exceptions would be cases that are specific to the structural nature of only a given array type, like a "number of nonzeroes" query on a sparse array).
That convinces me that in the short term we should deprecate reverse (or at least mark it unstable). I don't think reverse()
necessarily has to be a 1D operation: exactly what it should mean for N dimensions is questionable (I'd say whatever arrangement would cause the iteration order to reverse after applying it).
find() and count() are definitely in a grey-er area for me, as this discussion has proven. On one hand, they're not so intrinsically associated with the structure of the array that not having them impairs your use of the array. Yet they are arguably a bit related to the structure in the sense of "Does this collection store this value at all? Does it store multiple of them?" They also don't seem like particularly deep concepts and like the sort of thing you'd ideally want on any given collection type. They seem clearly rank-neutral which lets them pass that test.
I kind of like leaving count out just because + reduce (A == needle)
seems like such an elegant way of expressing the computation to me. I'd advocate whatever decision we make re: count for Arrays we do the same for Strings.
The reduction for find maxloc reduce zip(A == needle, A.domain)
seems a lot less obvious so I think providing an implementation somewhere makes sense. Our previous consensus was putting it in the search
module and if you asked me to decide that's still where I'd put it but if the winds have changed on this I'm willing to concede.
One advantage I see to making it a free function in the search
module rather than a method is that opens up the possibility for us to make this function generic in the sense that it would work on any new datatype a user creates (assuming it fulfills some concept such as being iterable). Of course we could have both a generic version in search
and have methods on our datatypes (which might just call out to the version in search
anyway) but if users get in the habit of using using a method version of .find
they'll be less likely do discover the free function version.
Maybe this suggests a next step here: To see whether the team: thinks these should be treated similarly across types or not (and if not, by what rationale) has a preference between keeping them on the types vs. moving them into a helper module if "similarly" wins the prior vote
Sure, I can try and drive this forward. My inclination at this point is to try and keep the conversation online rather than "in person" (so to speak), I can try and summarize do the polling thing.
Just to repeat my sense of where things stand (let me know if you disagree):
As I see it right now it seems like we have consensus that reverse
should be removed. It also seems like folks are happy with deprecating .sort()
so long as the sorted
iterator in the Sort
module.
Of course people are always free to comment, but I think the polls I'd bring up are:
If you were the only decision maker what would you do with find
: (1) make it a methods in Array, (2) make it free functions in the Search
module, or (3) remove it altogether
If you were the only decision maker what would you do with count
: (1) make it a method in Array, (2) make it free functions in the Search
module, or (3) remove it altogether
If we were to make find
a free function in the Search
module instead of a method on Array (for Chapel 2.0) would you feel that we were making a mistake? yes, no, or I'm not sure
If we were to make count
a free function in the Search
module instead of a method on Array (for Chapel 2.0) would you feel that we were making a mistake? yes, no, or I'm not sure
If we get everyone to vote the same on the last two polls I'd consider that consensus (not unanimity, but I don't think we aiming for unanimity, right?) Do you feel this is the right approach to take or do you think I'm being too leading with these last two poll questions?
There also seems to be a challenge to "put all find() methods into a Find/Search module" philosophy in that it's not extendsible to new collection types created by a user.
I guess I just made the opposite argument above. It seems like for things like find
and count
it's pretty easy to write a generic implementation that works on any iterable datatype. Sure that generic implementation may not be the most efficient but users can always override that generic function if/when efficiency becomes a concern.
I guess your argument is in order to write that efficient implementation users would need to expose the private members of their class? I don't have a good sense of if that would be the case more often than not. I suppose if Chapel were C++ I'd say: write a private findImpl()
method for your datatype, have the public find()
function friend
that method and call out to it. I'm not sure if we want to have a Chapel equivalent to the friend
keyword though.
Put another way, for this question I think we should focus most on what's attractive/useful/productive to users rather than what's easiest for us when defining the built-in types from a SW engineering perspective
I'm not sure these things are separable as you're suggesting. Of course I agree we should be user focused. But our users are programmers who are going to use and extend these datatypes too and given that it seems like whatever arguments one can make for what's right from a "SW engineering perspective" would be applicable to our users as well. Of course what's considered right form "SW engineering perspective" is a hugely debatable topic in itself.
While looking at the docs @arezaii noticed that my PR didn't include a workaround for reverse
in the deprecation message. Assuming we decide to keep this deprecated should probably update it to include a workaround (I think just add a loop through the array that goes in reverse order).
All the other removed functions have workarounds in their messages (as can be seen in this .good
file for this test: https://github.com/chapel-lang/chapel/blob/main/test/deprecated/deprecatedArrayFunctions.good)
I don't know that putting a workaround in the deprecation message is strictly required. If a user is relying on it and doesn't know where to turn, I'm sure we'll hear from them (and the deprecation message would disappear in a release or so anyway).
From Engin's last comment:
This implies that we should also extend the optional doi* interface to include doiFind, doiCount and whatever we may want to add to the array interface.
I think we definitely could do that, but don't know that we need to do it, or should do it before we do have a clear need. Specifically, I think find and count can be written plenty efficiently using the current dsi interface since the array is already telling us things like how to iterate over it optimally, and that's all that's required to do these ops well. So while I wouldn't say "never!", and the doi interface can always be extended, I also don't see a significant call for it at present.
Rethinking the comment that led to this, however, maybe I'm wrong that we could write a faster find/count on strings using the internal implementation than a user/standalone library routine code? We might just need to do a zippered iteration over the strings codepoints and 0..<size, whereas I may have only been thinking of using just the latter and random access when I wrote that (?).
Switching to an Andy comment I never replied to:
I kind of like leaving count out just because + reduce (A == needle) seems like such an elegant way of expressing the computation to me.
I agree it's elegant, and I definitely find it attractive, but it does require a slightly greater level of expertise in Chapel to write and know to write, relying on both promotion and reductions. While the deprecation warning is there, users will have their hand held, but I don't think we expect to keep the deprecation warning for all time. In contrast .count()
is something that any new-ish programmer might go looking for and know how to use.
I'd advocate whatever decision we make re: count for Arrays we do the same for Strings.
I like that, but strings don't have a similarly elegant and concise way of writing count, though, do they? This definitely relates to why I'm advocating for a count method.
Our previous consensus was putting it in the search module
"Our" here refers to a subgroup of people who met to discuss this, not the team as a whole, though, right?
Of course people are always free to comment, but I think the polls I'd bring up are:
I think your poll questions need to bring strings into the mix given that treating them symmetrically seemed to have resonance for both you and Engin. For example:
Q: should find()
be:
(a) a method on strings but a standalone routine in the Search
module for arrays ("the previous consensus")
(b) a method on strings but not supported for arrays at all ("the status quo")
(c) a method on both strings and arrays, defined where the types are
(d) a standalone routine for both strings and arrays defined in the Search
module
(e) something else
?
I also feel like in your list of four questions, the second pair are redundant with the first. I.e., if you want to see how strongly people feel about an option, have them rate each on a scale of 1-7 or use approval voting rather than asking which they want and then asking how much they do or don't like the one you're advocating for.
Sure that generic implementation may not be the most efficient but users can always override that generic function if/when efficiency becomes a concern. ... I guess your argument is in order to write that efficient implementation users would need to expose the private members of their class?
Right, after your first paragraph, I'd written: Users can't really override the generic function without adding a method to the type, can they? So type authors could add a method to their type to provide more efficient implementations, but clients of a type can't?
Hi @bradcray,
We're approaching our code freeze deadline and given that we don't have consensus on having these functions deprecated I'd like to move these functions to be 'unstable' instead.
Alternatively I could just reverse the deprecation as a stopgap measure if you feel that would be better.
This isn't to say I'll be closing this issue; I am planning on conducting a poll to see how people feel about this in relation to strings as you suggest. I just don't want to leave these deprecated as part of the release unless we're sure that's how they're going to stay.
I'd like to move these functions to be 'unstable' instead.
Alternatively I could just reverse the deprecation as a stopgap measure if you feel that would be better.
Either of those would be fine with me and preferable to deprecating them at this point.
I am planning on conducting a poll to see how people feel about this in relation to strings as you suggest.
:+1:
On 10/11/2022 we had a re-review meeting to discuss this issue. Coming out of it we agreed to the following:
We're going to deprecate: sorted
and reverse
on Array: nobody spoke up in favor of keeping these.
We're going to keep (as unstable): find
and count
in the Array datatype. We don't intend to remove or move these out of array but the exact interface may change.
There are inconsistencies in the interface for find
and count
across the array
, bytes
, string
, and list
datatypes. So we'll leave find
and count
as unstable for now with the goal of reevaluating these interfaces and seeing if we want to make them more consistent. The intent is not to have these left as unstable past 2.0: the functionality will be there, and will be accessible as a method on the array datatype, but the exact interface may change between now and then.
While working on my recent reverse-complement implementation in the evenings (#21329), I was shocked at how hard it was to write a Chapel-level find()
routine that could compete with C's memchr()
(for arrays of 8-bit values), and what a big impact this seemingly innocuous input post-processing step had on the overall execution time of what I think of as a fairly complex computation. For example, when piping the output to /dev/null
, Elliot saw improvements from ~1.03–1.09s for the best Chapel-based implementations to ~0.82s for a memchr-based version. And in nightly testing, when capturing the output, last night we saw a difference of ~1.94s to ~1.78s (comparing gcc8 transliteration to gcc8 translit (memchr) on this graph):
These experiences put me more firmly in the "array's should support find()
out of the box" camp (not that we're debating that anymore), as well as that their interfaces should be more similar to that of .find
on string
/bytes
(which is to say, should return the index, where I'd have it return an OOB index of some sort when the find fails).
When Engin originally mentioned the doiFind()
addition to the domain map interface, I was skeptical that it was necessary, but now I'm much more open to it, having failed to write an implementation that could compete with memchr()
(suggesting that a great implementation might use memchr()
on each locale for a distributed array, or in parallel on each locale to be even faster). As an implementation path, my thought is that we should:
It would be cool to add a revcomp version that uses array.find() and then see it improve over time (but it'd be more motivating to do this after stabilizing the interface).
After running into a desire for array.find() in my recent CLBG push, I prototyped a version of the new interface this weekend and kicked off a design issue about its interface here: https://github.com/chapel-lang/chapel/issues/21423
I think I can close this issue as we've implemented the decisions listed here:
https://github.com/chapel-lang/chapel/issues/18089#issuecomment-1276500706
As well as updated the interface of find
in this PR: https://github.com/chapel-lang/chapel/pull/21429
@stonea : Do we have an issue about what needs to be done to make count()
stable? (that's the main reason I didn't have #21429 close this issue)
@stonea : Do we have an issue about what needs to be done to make count() stable? (that's the main reason I didn't have https://github.com/chapel-lang/chapel/pull/21429 close this issue)
Good point. I've created an issue for that here: https://github.com/chapel-lang/chapel/issues/21730
Thank you!
This is a followup item for the Array module review.
One item we wanted further discussion on is:
These functions (remove sorted, reverse, find, and count) seem like nice utilities to have that operate on arrays but are not a fundamental part of the array type themselves. Should we remove these functions? Should we put them in some ArrayUtils or other module?