Optimization to convert heap allocated Chapel arrays to stack Allocated arrays

stonea commented 1 year ago

I'm throwing this out here as more of a brainstorming idea than anything I have an immediate need for. Chapel arrays are heap allocated; if the user wants something stack allocated they can use tuples or c_array (and we may have some other type to do this explicitly in the future).

I think having an explicit way to do this makes sense but in practice I imagine most Chapel programmers are going to use Chapel arrays without giving a second thought to "is this something that would better be stack allocated".

Perhaps we should have an optimization that should automatically convert a heap allocated Chapel array into a stack allocated array when the bounds of the array are statically known and the array is small. We have a similar optimization where if the domain is constant we don't track arrays that use that domain because we'll never need to relocate them.

I'm creating this issue as a reaction to some conversation here (https://github.com/chapel-lang/chapel/issues/20290#issuecomment-1487950125), where the underlying Github issue there is more generally with allocating arrays from within GPU kernels.

bradcray commented 1 year ago

I'm generally on-board with this direction, particularly if we provide layouts (local domain maps) that let a user specify that they explicitly want stack-allocated or heap-allocated memory to avoid being subject to the compiler's whims.

I'm not immediately seeing why we'd only want to do this if the array was small and statically known (in the sense that alloca should be able to handle it for large arrays). Is the idea "if it's large, we could overrun the stack, and should never do that" combined with "and we need to make this decision statically to ensure that the array's type statically known?" If so, I guess I buy that.

One other factor might be whether we access the array within forall/coforall loops (or begin/cobegins) or just for/foreach loops. Specifically, if we apply a forall, multiple tasks are going to want to access the array, so putting it on one task's stack may be odd. Whereas if we were to use for/foreach loops, it'd be fine for it to be on the current task's stack.

Finally, Engin noted elsewhere that with our current implementation of arrays, the metadata for the array is stored in a class, which is heap-allocated and seems unfortunate if the goal is for the array data itself to live on the stack. There've been a few cases where we've wanted to explore what would break if this metadata were stored as a record rather than a class (the array wrapper class is completely generic, so shouldn't care, but almost certainly something else would break); and what it would take to improve our domain map infrastructure to support record-based metadata, which would keep it on the stack.

[There was also some discussion about this topic in yesterday's deep-dive chat channel, for those within HPE]

vasslitvinov commented 1 year ago

Here are several improvement vectors and other questions that are orthogonal:

How many objects need to be created for a single Chapel array? Ideally, a single "object" that contains both the storage for the array elements and any metadata. We need a separate domain object if the user extracts it from the array, ex. var B: [A.domain] real;
Where should those objects be allocated? Ideally, on the stack when possible. See Brad's analysis of stack allocatability above. Another case where we want heap allocation is if it is a large array that is returned by value.
How much can we optimize if the array size is known at compile time? What if we replace such an array with a homogeneous tuple?
How much trouble should the user go into to have these optimizations?
Should we encourage users to use c_array when there is no interop?

mppf commented 1 year ago

Perhaps we should have an optimization that should automatically convert a heap allocated Chapel array into a stack allocated array when the bounds of the array are statically known and the array is small.

We need the compiler to decide if the array is potentially stack-allocate-able (stack-able?), but it is easy enough to fall back on a malloc/free pattern if it is not.

The size does not need to be known at compile time. We can have a runtime check that the size is small enough for stack allocation, and allocate on the heap otherwise.

I'm not immediately seeing why we'd only want to do this if the array was small and statically known (in the sense that alloca should be able to handle it for large arrays). Is the idea "if it's large, we could overrun the stack, and should never do that" combined with "and we need to make this decision statically to ensure that the array's type statically known?" If so, I guess I buy that.

To handle dynamic sizes, we can figure out the array type as "stackable array" but that type can have a runtime option of using a heap allocation.

And, indeed, if somebody wants to create a 16 MB array, that would overflow the typical stack which is 8 MB. So we need some sort of threshold. I'd start with a limit around 64 KB or so (because as the stack grows due to additional function calls, there might potentially be many such allocations; so clearly something like 4 MB or even 1 MB would be too big).

See also #10732 which is a more general request in this area. #10732 is asking an analysis that would put certain class instances on the stack -- it could potentially do this for array allocations as well but we might also benefit from an array-specific solution here.

chapel-lang / chapel

Optimization to convert heap allocated Chapel arrays to stack Allocated arrays #22004