Should arrays have a different type when they cannot be resized?

mppf commented 10 months ago

I'm thinking about arrays for two reasons:

We have a long-standing desire to implement simpler arrays with a simpler data structure. E.g. we might like to implement some arrays with something more like c_array or a tuple (which stores elements "in place" rather than through a pointer).
In discussion with @ronawho we talked a bit about the overhead of random access into arrays (in the context of random updates to a Block arrays through aggregation). While that specific problem probably needs a specific solution, it got me wondering if it would be possible to reduce the overhead by reducing the levels of indirection.

Background

Chapel's array representation currently works like this:

the array value is represented by a record _array which contains a pointer to a (potentially privatized) class instance implementing the array, for example class DefaultRectangularArr
the class instance implementing the array in turn contains a pointer to the elements. E.g. class DefaultRectangular has a field shiftedData.

Interaction with Other Features

Could we consider changing the current structure of Chapel arrays? To do so, we need to be aware of these 3 interactions with other features:

Chapel has always had an emphasis on supporting module- or user-defined array types. The current implementation uses parent classes & something like pure virtual methods to provide a kind of interface for arrays.
When an array is privatized, the record _array effectively points to a privatized copy of the class. Note that non-distributed arrays (e.g. class DefaultRectangularArr) are not generally subject to privatization.
It makes resizing arrays by modifying domains possible. The domains have a list of pointers to array class instances over that domain. When a domain is resized, both the array metadata within that class as well as the elements pointer needs to change. Currently, that is handled by calling virtually-dispatched methods on the array instance.

Of these, (1) is arguably an implementation detail that can be better handled by future work on the interface feature. And (2) is also arguably an implementation detail that can be better handled by changing how privatization works (see also #23877). However, (3) seems fundamental to the language.

In #22770, we discussed changing the way arrays are resized through their domain and opted not to change that feature broadly. However, that issue did generate some examples where users need to be aware of whether or not an array is declared over a mutable domain (e.g. https://github.com/chapel-lang/chapel/issues/22770#issuecomment-1693835007 ). Additionally, that issue generated the proposal that perhaps users could write an array declaration in a way that opts out of the array being resizable (e.g. var A: [const D] int).

Other Implementation Possibilities

If we ignore the above concerns for a moment, we can consider implementing arrays in ways that do not have one or both of these levels of indirection:

If an array has a compile-time known size and it cannot be resized, the elements can be stored in-place a manner more similar to c_array or a tuple, which have neither of the pointers/levels of indirection.
If an array cannot be resized, the class DefaultRectangular object's fields could be stored in the record _array. Such an array could still have a runtime-varying size, as long as it is not resized after it is created.
Alternatively, if an array cannot be resized, the array elements could be stored in a single allocation at the end of the class DefaultRectangular instance. It's my understanding that Fortran (or some Fortran implementations?) use this kind of approach (where it stores the array metadata at the start of an allocation and then the elements after). As with the above approach, such an array could still have a runtime-varying size, as long as it is not resized after it is created.

Could the array type include resizability?

Let's suppose for a moment that we wish to implement fixed-sized arrays with c_array, or something like it, assuming it is built up to have all the usual array functionality. The question is, does the compiler consider such fixed-sized arrays to be of a different type than other arrays? If it does not, is it possible to implement some arrays with something like c_array?

If the compiler considers them different types, I expect the difference to be observable to users. For example, perhaps the following program would no longer work. (Perhaps there could be implicit conversion between these kinds of array types. But, we wouldn't want to copy the elements when doing so, and so it seems a bit fraught. And, even implicit conversion has limits: I'm pretty sure at some point the user will be able to know that they are different types.

use List;

var A: [1..4] int;
var lst: list(A.type);
lst.pushBack(A);

var D = {1..4};
var B: [D] int;
lst.pushBack(B);

writeln(lst);

But, if the compiler considers them to be the same type, I don't understand how we could implement the improvement at all. I think we would be limited to doing things like hoisting the pointer to the elements in loops. I'm not seeing how it would be practical to implement an optimization that makes such a radical change to the data structure after all the types are already set. It might be possible but I'm not seeing it.

bradcray commented 10 months ago

I haven't had time to digest this issue yet, but wanted to mention that one of the spare-time activities I've been hoping to have time to try in the coming month is creating a "10-element, in-place" domain map using a record value instead of a class value and a c_array to store the elements to understand the extent to which we assume the current class-based approach in the array/domain machinery and what it would take to relax it.

mppf commented 10 months ago

In conversation with @benharsh, he pointed out that even with an in-place array, we could wrap it in an array view (or even possibly a DefaultRectangular). The trouble is, with this approach, we will still have the double-indirection in code working with the array.

I think what I'm wondering here is if we can avoid the double-indirection overhead for programs that don't rely on domain-assignment-resizes-array.

mppf commented 10 months ago

In talking with @e-kayrakli, I asked about why we didn't make the "const domain" part a param in earlier work. Engin pointed out that it makes challenges when that's part of a domain, e.g. can't assign a non-const domain to a const one, because the types differ. He also brought up a second issue where in some case domain creation was gradual & we needed to be able to set const-ness at runtime & he can try to find more about this case by looking for cases where definedConst is assigned to.

However, these cases don't seem to directly impact the idea proposed above, which is that the array (and not the domain) would know at compile-time if the domain was immutable / if the array is subject to resizing. The challenge with simply infering if the domain is declared const is that the compiler will not always be able to infer if the domain variable itself is immutable (e.g. if you have an array declared over a const ref domain, there could be an arbitrary amount of code between a const / var domain variable and the use of the const ref). But, it's certainly the case that we can know the domain is immutable in a case like var A:[1..10] int (and we would probably require such syntax for in-place arrays anyway since the bounds will have to be param).

mppf commented 10 months ago

I created issue #23890 about a long-term strategy to remove the indirection. This strategy does not require a different type for arrays that cannot be resized. Of course, we might decide that having a different type is still worthwhile (for type checking reasons or for implementation flexibility).

mppf commented 10 months ago

I'm currently thinking that the strategy in #23890 is unworkable due to the potential for a domain being assigned while an array is being moved: AFAIK, it would require locking for all of the array moves.

That brings it back to this issue, which asks if we can simplify and optimize the representation only for arrays that do not need to support resizing.

damianmoz commented 9 months ago

I think an array which allows itself to be resized is a luxury. So having the base array type not be resizable, i.e. stay at the size where it was first instantiated, seems reasonable.

Even if I wanted an array which was able to be resized, I would try and lock down its size as soon as possible into an array which was not resizable. The number of times I have to resize an array over the last few decades is very few. The last time of now was programming a matrix profile minimization routine. I then spent a bit more time on the task and eventually managed to avoid the resize.

But, that is based on my own experience and the type of problems I have to address Others may have a different need.

My 2c.

mppf commented 9 months ago

I'm currently thinking that the strategy in #23890 is unworkable due to the potential for a domain being assigned while an array is being moved: AFAIK, it would require locking for all of the array moves.

That brings it back to this issue, which asks if we can simplify and optimize the representation only for arrays that do not need to support resizing.

Elaborating a bit here: because of https://chapel-lang.org/docs/language/spec/domains.html#parallel-safety-with-respect-to-domains-and-arrays we can arguably outlaw assigning a domain while an array over that domain is being operated on in any way. However, we have to still support parallel tasks creating and destroying arrays over a given domain, and so the array move would still require locking / atomic updates. (See also https://github.com/chapel-lang/chapel/issues/23890#issuecomment-1874258924 ).

mppf commented 8 months ago

Of course, we can have a different distribution (aka layout or domain map) for different arrays, and different distributions can not allow domain resize. The issue is that I'm not sure we can automatically do that from something like var A: [1..100] int without a breaking change.

chapel-lang / chapel