chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

Improve the performance of slices and rank change operations #24343

Open mppf opened 8 months ago

mppf commented 8 months ago

The performance of array slice and rank change operations is not what users expect. Users who are coming from other languages (at least Fortran and I think numpy) expect these slice operations to be fast but they have high overhead in Chapel today. I think we need to prioritize improving this situation as it has been a show-stopper for some potential users. I'm creating this issue to help us to track and prioritize this effort.

Related issues / posts:

Implementation ideas & workarounds:

bradcray commented 8 months ago

FWIW, in recent performance team discussions, we've been referring to this one of our top-two known user performance issues (the other one being multidimensional zippering).

damianmoz commented 8 months ago

Sorry. Please note corrections and updated attached files.

To test the overhead that slicing an array introduces, I sort a target 1D array 't' using either a customized TimSort or Chapel's standard RadixSort. The program will (if asked) verify the TimSort results against RadixSort. Both sort methods are parallelized. Timsort first sorts the array 't' into little chunks with inserion sort, and then merges the little chunks with an optimized MergeSort. The array is not distributed. Only a single locale is used.

A user can (optionally) select a target 1D array 't' to be declared as:

t[L .. #M]
t[L .. by 1 #M]
t[L .. by 2 #M]
t[L .. by S #M]

i.e. an array with a size M, and one of an implicit stride of 1, or an explicit stride of either 1, 2, or some constant, the last being provided as a runtime option. The arrays fit into the CPU cache so cache-misses are not an issue. An option is provided that allows the sort routine to work on a reindex'ed reference to the original array, where that reindex'ing is done to match the first case above, i.e. a 1D array like the first case above with an implicit stride of 1.

The target array is sorted 'as-is' by default. An option allows the program to instead sort the reindex'ed reference to the target array. Using TimSort on an 'as-is' array with a stride other than 1 (known at compile time) runs much slower than sorting its reindex'ed ref.

NOTE reindex'ing does not affect the performance of RadixSort.

At every pass during the MergeSort, a custom binary search routine is called as either (the option selectable):

* binarySearch(t, trightlow, _lo, cut) //.OR
* binarySearch(t[_lo .. cut by s], trightlo)

The former passes the complete target array 't' but only does the search for 'trightlow' in the range _lo .. cut, The latter passes the slice of 't' to be searched across the entirety of that slice for 'trightlow', implicitly the same range. The binarySearch routine visits all the elements within the array 't' in both cases and does all the same comparisons. While the arguments are different in these cases, they are references. No copy is requested.

The sliced case runs much slower than the unsliced case. See below.

The program, and a shell script to drive it, are attached: m.chpl.txt testm.txt

Here are the options to the program which sorts an array 't' of data.

// simple options

config const N = 10000; // number of times the program is run

config const V = 'n';   // verify TimSort results match RadixSort
config const U = 'y';   // pass unsliced or sliced 't' to binary search

config const M = 301;   // number of elements in target array
config const S =   1;   // stride of 't'
config const L =   1;   // lowerbound of 't'
config const B =   6;   // TimSort blocksize is 2^B

// mode is one of
//
// 't'  TimSort
// 'r'  RadixSort

config const mode = 't';

// I specifies that either:
// I specifies that either:
//
// 'y' : use a reindex'ed reference to 't' i.e. as 1..#L
// 'n' : use original 't'

config const I = 'n';

//  kind specifies the domain of the array to be sorted:
//
//  1   t[L .. by 1 #M]
//  2   t[L .. by 2 #M]
//  x   t[L .. by S #M] - specify S above
//  u   t[L .. #M]
//
config const kind = 'x';

Here are some results using the shell script provided:

Here ..tim.. (and .radix.) imply that TImSort (RadixSort) was used.(

UNSLICED

S=2: 6.474 .radix. by S as-is 1..55554 by 2 Size = 27777 12.200 .radix. by S Reindexed 1..55554 by 2 Size = 27777

19.397 ..tim.. by S as-is 1..55554 by 2 Size = 27777 4.863 ..tim.. by S Reindexed 1..55554 by 2 Size = 27777 19.245 ..tim.. by 2 as-is 1..55554 by 2 Size = 27777 4.817 ..tim.. by 2 Reindexed 1..55554 by 2 Size = 27777 4.912 ..tim.. unit as-is 1..27777 Size = 27777

19.256 ..tim.. by S as-is 1..55554 by 2 Size = 27777 4.863 ..tim.. by S Reindexed 1..55554 by 2 Size = 27777 19.271 ..tim.. by 2 as-is 1..55554 by 2 Size = 27777 4.943 ..tim.. by 2 Reindexed 1..55554 by 2 Size = 27777 4.872 ..tim.. unit as-is 1..27777 Size = 27777

S=1: 15.802 .radix. by S as-is 1..27777 Size = 27777 12.134 .radix. by S Reindexed 1..27777 Size = 27777

18.778 ..tim.. by S as-is 1..27777 Size = 27777 4.882 ..tim.. by S Reindexed 1..27777 Size = 27777 4.962 ..tim.. by 1 as-is 1..27777 Size = 27777 4.905 ..tim.. by 1 Reindexed 1..27777 Size = 27777 4.834 ..tim.. unit as-is 1..27777 Size = 27777

18.906 ..tim.. by S as-is 1..27777 Size = 27777 4.890 ..tim.. by S Reindexed 1..27777 Size = 27777 4.835 ..tim.. by 1 as-is 1..27777 Size = 27777 4.849 ..tim.. by 1 Reindexed 1..27777 Size = 27777 4.845 ..tim.. unit as-is 1..27777 Size = 27777

SLICED

S=2: 6.293 .radix. by S as-is 1..55554 by 2 Size = 27777 12.200 .radix. by S Reindexed 1..55554 by 2 Size = 27777

19.555 ..tim.. by S as-is 1..55554 by 2 Size = 27777 7.244 ..tim.. by S Reindexed 1..55554 by 2 Size = 27777 19.259 ..tim.. by 2 as-is 1..55554 by 2 Size = 27777 7.325 ..tim.. by 2 Reindexed 1..55554 by 2 Size = 27777 4.988 ..tim.. unit as-is 1..27777 Size = 27777

19.414 ..tim.. by S as-is 1..55554 by 2 Size = 27777 7.253 ..tim.. by S Reindexed 1..55554 by 2 Size = 27777 19.364 ..tim.. by 2 as-is 1..55554 by 2 Size = 27777 7.869 ..tim.. by 2 Reindexed 1..55554 by 2 Size = 27777 5.313 ..tim.. unit as-is 1..27777 Size = 27777

S=1: 15.169 .radix. by S as-is 1..27777 Size = 27777 12.083 .radix. by S Reindexed 1..27777 Size = 27777

18.975 ..tim.. by S as-is 1..27777 Size = 27777 7.362 ..tim.. by S Reindexed 1..27777 Size = 27777 5.023 ..tim.. by 1 as-is 1..27777 Size = 27777 7.289 ..tim.. by 1 Reindexed 1..27777 Size = 27777 5.046 ..tim.. unit as-is 1..27777 Size = 27777

19.135 ..tim.. by S as-is 1..27777 Size = 27777 7.336 ..tim.. by S Reindexed 1..27777 Size = 27777 5.025 ..tim.. by 1 as-is 1..27777 Size = 27777 7.392 ..tim.. by 1 Reindexed 1..27777 Size = 27777 5.089 ..tim.. unit as-is 1..27777 Size = 27777

damianmoz commented 8 months ago

Please note updated text, results and attached files (in the previous post).

damianmoz commented 8 months ago

If you disable every statement except the return in my custom binarySearch routine, it becomes only an approximation to the correct result. Hut it is still useful enough in an array of less than 50,000 for testing purposes. But means that the overhead in calling wth the array t sliced to the indices relevant to the task being performed at this pass

bnarySearch(t[_lo to _hi by s], trightlow);

is the calling sequence.

The slowdown seen is about 50% which I think means that the calling sequence when it includes a slice is 50% of the work involved in the merger of two sorted halves of an array, i.e. the copying of the array back into itself which is not an insignificant amount of work in its own right.. I do not know how it works but as Brad noted in another post which I cannot find, maybe it needs to use some sort of dope vector approach which is the way Algol worked all those years ago.

e-kayrakli commented 3 months ago

https://github.com/chapel-lang/chapel/pull/24390 will address some of the most common slice and rank-change operations in the context of whole-array operations.

As a more future-looking way of expanding what that PR does, I'll capture some thoughts here. That PR introduces the type chpl__protoSlice, which is hidden from the user and only used by the compiler if it can determine a statement suitable for using proto slices. In an ideal world (and not thinking too deeply about how one might be able to implement something like that), we can make all slicing/rank-changing expressions create such a type instead of an array view as they do today. Then, objects of that instance could determine whether they need to be "upgraded" to a full-blown array view depending on the context in which they are used. Now, I don't think objects themselves can make that decision, but something like that could be implemented as a coercion of sorts in the compiler.

Note that the current compiler implementation after that PR is more proactive where the compiler introduces values of that type based on very specific patterns, whereas what I am suggesting is more reactive where values of that type will be created in all cases. But that they could be coerced in some scenarios reactively.

That's because while I think there's a ton of room improving upon that PR to make whole-array operations involving array views faster, expanding it to some cases that are mentioned in this issue takes a different compiler approach.