chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

Index-neutral programming: indices #15024

Closed bradcray closed 4 years ago

bradcray commented 4 years ago

[edited to remove .lowIdx and .highIdx from the proposal for simplicity. In the comments, it was argued that such queries can be made from .indices when applicable and that this keeps things simpler / more orthogonal; we can always reconsider later if compelling arguments for .lowIdx and .highIdx come up].

As a Chapel programmer, I'd like to see most standard collection types support the method~s~ .indices ~, .lowIdx, and .highIdx~ in order to support programming that's neutral w.r.t. whether 0- vs. 1-based vs. something-else-indexing is used by the type. For example, these would permit me to write

for i in data.indices do
  ...data(i)...  // or ...data[i]...

for many collection types.

As a starting point, it would be nice to support these for tuples, strings, bytes, lists, and arrays.

[edit: and enums?]

~Interestingly, I think the implementations and possible uses of these are likely to vary a bit between types.~ I would expect tuples, strings, bytes, and lists (edit: and enums) can return a range for indices whereas ~rectangular~ arrays would likely return their domains ~and associative arrays would implement indices as an iterator~.

~Similarly, lowIdx can be param for tuples, strings, bytes, and lists, but not arrays; and highIdx can be param only for tuples.~

bradcray commented 4 years ago

Note that I chose .lowIdx and .highIdx to avoid the potential ambiguity that .low and .high might refer to values rather than indices. Of course, for ordered collections .indices.low and .indices.high should also work.

e-kayrakli commented 4 years ago

The concept and the names suggested make sense to me.

As a half-baked idea, I am wondering whether firstIdx, lastIdx makes things more general instead of low/high. I am imagining (vaguely) a structure where indices can be in a funny order. In that scenario first/last might make more sense as "this is the index of the first element my default iterator would be yielding". It may also have more clarity for multidimensional indices maybe?

To be clear, I am not objecting to low/high, but more thinking out loud.

dlongnecke-cray commented 4 years ago

The concept and the names suggested make sense to me.

As a half-baked idea, I am wondering whether firstIdx, lastIdx makes things more general instead of low/high. I am imagining (vaguely) a structure where indices can be in a funny order. In that scenario first/last might make more sense as "this is the index of the first element my default iterator would be yielding". It may also have more clarity for multidimensional indices maybe?

To be clear, I am not objecting to low/high, but more thinking out loud.

I would prefer firstIdx and lastIdx instead of lo/hi because we already have first and last for arrays and list, which are both indexible containers.

ben-albrecht commented 4 years ago

If indices is always a range, we could get away with only supporting that, and having users rely on indices.low, indices.high, indices.first, and indices.last.

I don't mind the verbosity of this (and may actually prefer it over lowIdx / highIdx). More importantly, this puts less on the sequential data structure interfaces.

dlongnecke-cray commented 4 years ago

I actually really like that idea @ben-albrecht. I agree that we should avoid adding too many new methods if possible.

bradcray commented 4 years ago

I'm open to the proposal to just support .indices, particularly since it reduces the amount of work I have to do (though... I've already done it :'( ). I think the main case where it hurts a bit could be loops over heterogeneous tuples, which would have to be written:

for param i in myTuple.indices.low..myTuple.indices.high do
  ...myTuple(i)...

due to the need to unroll such loops to have their bodies make sense. But hopefully issue #14929 will simplify that case which will lessen the pain.

mppf commented 4 years ago

I feel fine with .indices, .lowIdx, and .highIdx in spirit but would do two things to it specifically:

That leads to .indices() .lowIndex and .highIndex.

ben-albrecht commented 4 years ago

I think the main case where it hurts a bit could be loops over heterogeneous tuples

I don't personally see this as a big downside, especially since it is a temporary.

bradcray commented 4 years ago

I don't personally see this as a big downside, especially since it is a temporary.

I agree; was just pointing out the only hesitation I could come up with.

bradcray commented 4 years ago

use parens on indices at least if it has the potential to be an iterator that makes sense to me)

When you say it, this rationale makes sense to me as well, though I think it makes code look regrettably a little uglier. I.e.:

for i in myTuple.indices() do

rather than:

for i in myTuple.indices do

maybe I just need to get used to it, though...

bradcray commented 4 years ago

I agree; was just pointing out the only hesitation I could come up with.

Oops, I forgot something a bit more crucial: This:

for param i in myTuple.indices.low..myTuple.indices.high do
  ...myTuple(i)...

doesn't actually work because we don't have param ranges (yet, and won't soon), and the bounds of the heterogeneous tuple's param loop have to be params. So without a .lowIdx or equivalent, loops over heterogeneous tuples would have to continue to hard-code a 0 or 1 into their code until #14929 became available. But I'm still OK leaving these calls behind for now.

bradcray commented 4 years ago

rectangular arrays would likely return their domains and associative arrays would implement indices as an iterator.

On rethinking this, I'm thinking that all arrays could just return their domain... I think I was thinking that .indices would need to be well-ordered in order to return a domain, but now I can't think why that would be...

ben-albrecht commented 4 years ago

use parens on indices at least if it has the potential to be an iterator that makes sense to me)

When you say it, this rationale makes sense to me as well, though I think it makes code look regrettably a little uglier.

I feel similarly, but think parens is the right call if it is an iterator.

What would we do about the map type? Will it have redundant methods map.keys() and map.indices()?

mppf commented 4 years ago

On rethinking this, I'm thinking that all arrays could just return their domain...

In which case I am again OK with .indices and I have a feeling that is what you are going for...

mppf commented 4 years ago

What would we do about the map type? Will it have redundant methods map.keys() and map.indices()?

One could argue that this method/term only makes sense for numeric indices (vs "keys"). Then it might not apply to associative domains/arrays either.

bradcray commented 4 years ago

I am again OK with .indices and I have a feeling that is what you are going for...

Not necessarily, if people prefer the parens. I was mostly just self-code-reviewing and realizing my mis-thought.

damianmoz commented 4 years ago

Are we talking the lower bound and upper bound of an index here?

mppf commented 4 years ago

@damianmoz we might not be using the same terminology but I'm going to try to clarify.

It's not about the index data type so much as about the collection it is addressing.

E.g. for a tuple, right now tuples are indexed starting from 1, but we want to change that to starting from 0 in this release. This issue is about giving programmers a way of writing code for tuples (and strings, bytes, lists, ...) that doesn't require them to make their code use the fact of tuples starting at 0 or 1.

var tup = (10, 20, 30);
writeln(tup(tup.lowIdx)); // always 10, no matter if tuple indexing starts at 0 or 1
writeln(tup(tup.highIdx)); // always 30, no matter if tuple indexing starts at 0 or 1
for i in tup.indices do writeln(tup(i)); // always outputs 10 20 30, no matter if tuple indexing starts at 0 or 1
}

This functionality will be useful for people trying to write Chapel code that works with both previous 1-based releases and the upcoming 0-based release. Additionally it will be useful for programmers that (for whatever reason) wish to avoid encoding whether tuples start from 0 in their programs.

Specifically for the case of tup.indices in the above example, that is a simpler and more elegant way of writing it than the current practice of in any case.

for i in 1..tup.size do writeln(tup(i)); // current practice
for i in tup.indices do writeln(tup(i)); // this issue requests

In that specific use case, the feature this issue requests allows the loop to be written in a lower-complexity manner.

bradcray commented 4 years ago

Just to summarize my current thinking on this issue: I was swayed by @ben-albrecht's well-received comment proposing that we only support indices and require .low/.high/.first/.last to be applied to the result of .indices rather than creating a short-circuit for it. This has the benefit of keeping the interface smaller and avoiding the need to make that naming decision. I also made progress on #14929 on Friday which was the main case that was calling for direct access to low and high. Also, we can always add more helpers later if/when people want and request them, whereas removing them is harder. Note that there may be reason to do so given that they return params in some cases which are not available via ranges (in part because we don't support param ranges today; in part because in cases like strings and bytes, the high generally cannot be a param, where the low can be).

I still feel on the fence about whether the indices method should take parens or not. I'm going to set up a few comments below for those who are following this to vote on.

bradcray commented 4 years ago

"Like" this comment to express support for .indices (you can also post a follow-up comment rationalizing your vote)

Arguments in favor:

bradcray commented 4 years ago

"Like" this comment to express support for .indices() (you can also post a follow-up comment rationalizing your vote)

Arguments in favor:

vasslitvinov commented 4 years ago

I voted for the no-paren version because it is akin to .domain on an array.

ben-albrecht commented 4 years ago

Still undecided here. If it is always a range, I am in favor of indices. If it is always an iterator, I am in favor of indices(). If it can be either (which it sounds like it can be), I am not sure.

lydia-duncan commented 4 years ago

I feel like it would be an easy logical leap to extend this method to include other arguments controlling what is returned, so it would make sense to me to have parentheses in case we follow that route later

bradcray commented 4 years ago

to include other arguments controlling what is returned

What kinds of arguments do you have in mind?

lydia-duncan commented 4 years ago

Dimensional limitations in the case of the array version, maybe an explicit stride?

bradcray commented 4 years ago

That doesn't feel in the spirit of "tell me what your indices are" to me, particularly since it doesn't extend to any of the 1D cases, which are the main motivator for this query. Seems like if someone wanted that, they should write something like myArray.indices.dim(1) by 4.

dlongnecke-cray commented 4 years ago

I voted for .indices.

for currently planned use-cases (tuples, strings, bytes, arrays, lists), it represents an O(1) operation that returns the indices in closed form (a range or domain), so is arguably "property-like" as opposed to "computation-like"

I see that "O(1)ness" as an important constraint. For everything else, there's an iterator...

e-kayrakli commented 4 years ago

I read this multiple times without a clear decision, but I am mildly leaning towards, and voted for indices(). The arguments that I can make about this:

bradcray commented 4 years ago

Keep in mind that (without caching the result), string.size is not an O(1) operation either, yet has no parens. I.e., I don't think there's anything that mandates that paren-less functions that are querying properties that could be stored as a field be O(1); it just makes it an easier choice when they are.

ronawho commented 4 years ago

I voted for .indices, but I don't have a strong preference between the two.

I also found myself wondering if .domain is a good term for this. I think it's accurate (from the spec: "A domain is a first-class representation of an index set") but I'm not sure that I'd want to pay the performance overheads of our domains (heap allocation, reference counting)

mppf commented 4 years ago

@ronawho - that's an intriguing idea. If we imagine for a moment that we had a different type for 1-D non-strided and anonymous domains... well that could just be basically the range type, right?

Separately though I think part of the proposal is that .indices return a range and if the type system distinguishes between domain and range and considers them different types, then I'm not sure I'd want to have a .domain method that returned a range (say). I think you're saying that tuple.domain could literally return a domain, if our implementation for that kind of domain was tight enough. But even then, AFAIK domains have some different methods than ranges that make them more complex to work with supposing you only cared about 1-D non-strided domains. But, that might be not a big problem in practice and might be addressable with the right methods on domains.

vasslitvinov commented 4 years ago

BTW there is also .shape

We could define .shape or .indices to be somewhat like .domain:

Right now .shape returns a tuple of integers, indicating the number of elements in each dimension a la NumPy's ndarray.shape. Proposed in #5178 by @ben-albrecht. How much are we attached to that?

ben-albrecht commented 4 years ago

If it can be either (which it sounds like it can be), I am not sure.

My indecision was based on the assumption that indices could sometimes be an iterator rather than always a range. However, reading back through the comments, I have not seen a concrete example of this.

Assuming that indices will only apply to ordered data structures with numeric indices, are there cases where indices must be an iterator rather than a range?

vasslitvinov commented 4 years ago

Assuming that indices will only apply to ordered data structures with numeric indices, are there cases where indices must be an iterator rather than a range?

If we apply it to an associative domain/array over integers, it would be an iterator.

ben-albrecht commented 4 years ago

One could argue that this method/term only makes sense for numeric indices (vs "keys"). Then it might not apply to associative domains/arrays either.

If we apply it to an associative domain/array over integers, it would be an iterator.

Would associative domain/arrays over integers support indices though? It would be strange for associative arrays and domains to support an indices method depending on the domain idxType.

bradcray commented 4 years ago

@ronawho: My thinking was similar, but the opposite: That maybe we could retire the somewhat funny .domain query on arrays (funny in that it relies on a reserved keyword as a method name) in favor of .indices over time (or now, if there was support). .indices has the benefit of being a more user-approachable concept while also not suggesting that we need to create domains for types that don't have them (i.e., I think it's reasonable to have .indices return a range for a tuple and a domain for an array).

@vasslitvinov: I don't think we should re-define shape for this purpose. We talk about rectangular arrays being assignable to one another if they have the same shape. If shape means "indices" then we have to be careful about that. Besides which, nothing about the word shape suggests "what specific indices can I use to access you" to me.

@ben-albrecht: If we attached .indices to things that were irregular and didn't have pre-stored set of indices associated with them, then it might be O(n)—either because one might implement it as an iterator (who'd know?) or because they might run that iterator internally to build up the set of indices and return it as... a collection? a domain? something. But you're right that none of the data structures that currently support .indices in my PR use an iterator for that purpose.

@vasslitvinov wrote:

If we apply it to an associative domain/array over integers, it would be an iterator.

But that's not right (I mis-spoke in the OP and corrected myself): it returns the pre-existing associative domain itself, identically to how .domain does today.

@ben-albrecht wrote:

Would associative domain/arrays over integers support indices though?

They do in my PR today (because it was easy / seemed orthogonal).

gbtitus commented 4 years ago

I voted for indices() because in an ideal start-from-scratch world I would want this to be the single best way to ask for the indices of a 1D sequence type. It would be applicable to tuples, strings, bytes, lists, arrays, and domains. It would take an optional param rank (default 1) and it would always return a range. D.dim() where D was a domain would go away and its uses would be replaced by D.indices(). Then you could write representation-agnostic generic code such as the following, without having to know whether x was an array or a string or whatever:

if x.type.rank == 1 && /* if we had it */ isSequenceType(x.type) {
  // something with x.indices()
}
bradcray commented 4 years ago

@gbtitus: I'm not clear on what the role of the optional param rank argument is in your proposal: What would my3DArr.indices(rank=2) do? What would my2DSparseArr(rank=2) do?

Then you could write representation-agnostic generic code such as the following, without having to know whether x was an array or a string or whatever

I don't see anything about the current proposal that would prevent you from writing this same code, simply without the parens.

mppf commented 4 years ago

@ronawho: My thinking was similar, but the opposite: That maybe we could retire the somewhat funny .domain query on arrays (funny in that it relies on a reserved keyword as a method name) in favor of .indices over time (or now, if there was support). .indices has the benefit of being a more user-approachable concept while also not suggesting that we need to create domains for types that don't have them (i.e., I think it's reasonable to have .indices return a range for a tuple and a domain for an array).

Wouldn't that present problems with forall and distribution of work? E.g. for a 1-D Block distirbuted array, if .indices returned a range, and we didn't have .domain, forall i in MyBlockArr.indices { } would not run across all the locales on which the array is distributed. How could .indices and .domain be unified given that?

bradcray commented 4 years ago

Wouldn't that present problems with forall and distribution of work?

Maybe I missed a proposal somewhere along the way. In my proposal and branch, .indices on an array always returns its domain, even if it's 1D.

mppf commented 4 years ago

Oh, sorry I misunderstood. You were saying we'd simply deprecate .domain but that .indices would do literally the same thing for arrays (i.e. return a domain). That sounds reasonable to me, although I don't have a strong feeling about it either way.

bradcray commented 4 years ago

Right. I'm not proposing that in this issue, but I think it would make sense in that it would remove functionality that's about to become accessible in two ways (.domain and .indices) while also removing the weirdness of supporting a method call that's a reserved word (and one that a user can't define themselves, at least without knowing the hidden secrets of our implementation). And if we were to either make all these types respect .domain or all of them respect .indices I think .indices is the obvious better answer (clearer meaning, doesn't rely on reserved word, doesn't imply things have domains when they don't).

ben-albrecht commented 4 years ago

Something bothers me about a paren-less method (x.indices) returning an iterator. Are there many other instances of this in the standard library?

Also, I wonder if will present a challenge with FCF/FCI (#15058) support, where a user would like to capture the callable iterator. If we used parens, they could do this by omitting the parens:

var method = data.indices;
for i in method() {
  ...
}

If we decide indices always return a range or domain then I am strongly in favor of no parens (indices).,

gbtitus commented 4 years ago

@bradcray The parens are so you can do:

if /* if we had it */ isSequenceType(x.type) {
  for param rank in 1..x.type.rank {
    // something with x.indices(rank)
  }
}

I don't see anything about the current proposal that would prevent you from writing this same code, simply without the parens.

I was worried about the fact that in the current proposal, for an array indices (with or without parens) returns a domain, not a range, and what that might imply about one's ability to write the same code no matter what kind of sequence type was involved.

bradcray commented 4 years ago

Something bothers me about a paren-less method (x.indices) returning an iterator. Are there many other instances of this in the standard library?

I think the first thing we need before being too bothered by this concern is for someone to come up with a compelling case of a type that would want .indices to return an iterator (or what I've been saying: have it be implemented as an iterator).

[I apologize if I opened up this Pandora's box myself by mistakenly stating that it would need to be an iterator in the cases of sparse or associative arrays.]

bradcray commented 4 years ago

for param rank in 1..x.type.rank { // something with x.indices(rank) }

The problem I see with this approach is that the indices of a multidimensional collection aren't always representable as the cross product of per-dimension index ranges, sparse arrays being a key example. It also has the downside of injecting numbering ("are dimensions 0-based or 1-based?") into an interface that's being designed to avoid the need for such numbering in many cases.

I was worried about the fact that in the current proposal, for an array indices (with or without parens) returns a domain, not a range, and what that might imply about one's ability to write the same code no matter what kind of sequence type was involved.

You're obviously correct in saying that if .indices returns different types in different situations then a given piece of code won't necessarily be generic across all types. But I also think that's OK because I don't view the primary goal of .indices as being to support generic programming. Rather, I view its goals as being (1) calling a property that's common across a number of types by a consistent name for familiarity, (2) removing hard-coded assumptions about what indices our built-in types use (e.g., "0-based or 1-based?")

By analogy, .size doesn't return the same type for all ranges, domains, files, etc. but it's still helpful to have a consistent name for it.

That said, like .size, I also believe there's also enough commonality across domains and ranges (the main types I expect .indices to return) that many simple patterns will translate across different scenarios even though the types returned may differ. By analogy, many forall loops over domains, arrays, ranges, or tuples can be written similarly in simple cases regardless of the fact that the index types may turn out to be very different. But it's also easy to write forall loops (or code near .size queries for that matter) that are sufficiently sensitive to the specific type being used that they'll break in other cases. But I don't see that as an indication that forall loops or .size queries are broken, just that when writing generic-ish code, the author needs to be mindful of what assumptions they're baking into it.

gbtitus commented 4 years ago

... the indices of a multidimensional collection aren't always representable as the cross product of per-dimension index ranges, ...

and

... has the downside of injecting numbering ("are dimensions 0-based or 1-based?") into an interface that's being designed to avoid the need for such numbering ...

Points taken, and I guess for the latter one would just have to go through x.domain.dims() to get to the dimensions, so indices or indices() wouldn't be involved anyway.

I've changed my vote now, because this query seems more data-like than computation-like to me.

ben-albrecht commented 4 years ago

Something bothers me about a paren-less method (x.indices) returning an iterator. Are there many other instances of this in the standard library?

I think the first thing we need before being too bothered by this concern is for someone to come up with a compelling case of a type that would want .indices to return an iterator (or what I've been saying: have it be implemented as an iterator).

[I apologize if I opened up this Pandora's box myself by mistakenly stating that it would need to be an iterator in the cases of sparse or associative arrays.]

OK. I think that clears things up for me.

To summarize: The current design is that indices will always be a domain or range (unless someone proposes a case for otherwise), therefore I am voting in favor of .indices.

bradcray commented 4 years ago

Thanks to everyone for stepping up to vote and for caucusing. Based on the votes and discussion, let's go with paren-less .indices (it may not have felt like it, but I was pretty on the fence when I posted the voting options, and the discussion definitely helped me make up my mind as well).