Open mppf opened 8 months ago
FWIW, in recent performance team discussions, we've been referring to this one of our top-two known user performance issues (the other one being multidimensional zippering).
Sorry. Please note corrections and updated attached files.
To test the overhead that slicing an array introduces, I sort a target 1D array 't' using either a customized TimSort or Chapel's standard RadixSort. The program will (if asked) verify the TimSort results against RadixSort. Both sort methods are parallelized. Timsort first sorts the array 't' into little chunks with inserion sort, and then merges the little chunks with an optimized MergeSort. The array is not distributed. Only a single locale is used.
A user can (optionally) select a target 1D array 't' to be declared as:
t[L .. #M]
t[L .. by 1 #M]
t[L .. by 2 #M]
t[L .. by S #M]
i.e. an array with a size M, and one of an implicit stride of 1, or
an explicit stride of either 1, 2, or some constant, the last being
provided as a runtime option. The arrays fit into the CPU cache so
cache-misses are not an issue. An option is provided that allows the
sort routine to work on a reindex
'ed reference to the original array,
where that reindex
'ing is done to match the first case above, i.e. a
1D array like the first case above with an implicit stride of 1.
The target array is sorted 'as-is' by default. An option allows the
program to instead sort the reindex
'ed reference to the target array.
Using TimSort on an 'as-is' array with a stride other than 1 (known
at compile time) runs much slower than sorting its reindex
'ed ref.
NOTE reindex
'ing does not affect the performance of RadixSort.
At every pass during the MergeSort, a custom binary search routine is called as either (the option selectable):
* binarySearch(t, trightlow, _lo, cut) //.OR
* binarySearch(t[_lo .. cut by s], trightlo)
The former passes the complete target array 't' but only does the search for 'trightlow' in the range _lo .. cut, The latter passes the slice of 't' to be searched across the entirety of that slice for 'trightlow', implicitly the same range. The binarySearch routine visits all the elements within the array 't' in both cases and does all the same comparisons. While the arguments are different in these cases, they are references. No copy is requested.
The sliced case runs much slower than the unsliced case. See below.
The program, and a shell script to drive it, are attached: m.chpl.txt testm.txt
Here are the options to the program which sorts an array 't' of data.
// simple options
config const N = 10000; // number of times the program is run
config const V = 'n'; // verify TimSort results match RadixSort
config const U = 'y'; // pass unsliced or sliced 't' to binary search
config const M = 301; // number of elements in target array
config const S = 1; // stride of 't'
config const L = 1; // lowerbound of 't'
config const B = 6; // TimSort blocksize is 2^B
// mode is one of
//
// 't' TimSort
// 'r' RadixSort
config const mode = 't';
// I specifies that either:
// I specifies that either:
//
// 'y' : use a reindex'ed reference to 't' i.e. as 1..#L
// 'n' : use original 't'
config const I = 'n';
// kind specifies the domain of the array to be sorted:
//
// 1 t[L .. by 1 #M]
// 2 t[L .. by 2 #M]
// x t[L .. by S #M] - specify S above
// u t[L .. #M]
//
config const kind = 'x';
Here are some results using the shell script provided:
Here ..tim.. (and .radix.) imply that TImSort (RadixSort) was used.(
UNSLICED
S=2: 6.474 .radix. by S as-is 1..55554 by 2 Size = 27777 12.200 .radix. by S Reindexed 1..55554 by 2 Size = 27777
19.397 ..tim.. by S as-is 1..55554 by 2 Size = 27777 4.863 ..tim.. by S Reindexed 1..55554 by 2 Size = 27777 19.245 ..tim.. by 2 as-is 1..55554 by 2 Size = 27777 4.817 ..tim.. by 2 Reindexed 1..55554 by 2 Size = 27777 4.912 ..tim.. unit as-is 1..27777 Size = 27777
19.256 ..tim.. by S as-is 1..55554 by 2 Size = 27777 4.863 ..tim.. by S Reindexed 1..55554 by 2 Size = 27777 19.271 ..tim.. by 2 as-is 1..55554 by 2 Size = 27777 4.943 ..tim.. by 2 Reindexed 1..55554 by 2 Size = 27777 4.872 ..tim.. unit as-is 1..27777 Size = 27777
S=1: 15.802 .radix. by S as-is 1..27777 Size = 27777 12.134 .radix. by S Reindexed 1..27777 Size = 27777
18.778 ..tim.. by S as-is 1..27777 Size = 27777 4.882 ..tim.. by S Reindexed 1..27777 Size = 27777 4.962 ..tim.. by 1 as-is 1..27777 Size = 27777 4.905 ..tim.. by 1 Reindexed 1..27777 Size = 27777 4.834 ..tim.. unit as-is 1..27777 Size = 27777
18.906 ..tim.. by S as-is 1..27777 Size = 27777 4.890 ..tim.. by S Reindexed 1..27777 Size = 27777 4.835 ..tim.. by 1 as-is 1..27777 Size = 27777 4.849 ..tim.. by 1 Reindexed 1..27777 Size = 27777 4.845 ..tim.. unit as-is 1..27777 Size = 27777
SLICED
S=2: 6.293 .radix. by S as-is 1..55554 by 2 Size = 27777 12.200 .radix. by S Reindexed 1..55554 by 2 Size = 27777
19.555 ..tim.. by S as-is 1..55554 by 2 Size = 27777 7.244 ..tim.. by S Reindexed 1..55554 by 2 Size = 27777 19.259 ..tim.. by 2 as-is 1..55554 by 2 Size = 27777 7.325 ..tim.. by 2 Reindexed 1..55554 by 2 Size = 27777 4.988 ..tim.. unit as-is 1..27777 Size = 27777
19.414 ..tim.. by S as-is 1..55554 by 2 Size = 27777 7.253 ..tim.. by S Reindexed 1..55554 by 2 Size = 27777 19.364 ..tim.. by 2 as-is 1..55554 by 2 Size = 27777 7.869 ..tim.. by 2 Reindexed 1..55554 by 2 Size = 27777 5.313 ..tim.. unit as-is 1..27777 Size = 27777
S=1: 15.169 .radix. by S as-is 1..27777 Size = 27777 12.083 .radix. by S Reindexed 1..27777 Size = 27777
18.975 ..tim.. by S as-is 1..27777 Size = 27777 7.362 ..tim.. by S Reindexed 1..27777 Size = 27777 5.023 ..tim.. by 1 as-is 1..27777 Size = 27777 7.289 ..tim.. by 1 Reindexed 1..27777 Size = 27777 5.046 ..tim.. unit as-is 1..27777 Size = 27777
19.135 ..tim.. by S as-is 1..27777 Size = 27777 7.336 ..tim.. by S Reindexed 1..27777 Size = 27777 5.025 ..tim.. by 1 as-is 1..27777 Size = 27777 7.392 ..tim.. by 1 Reindexed 1..27777 Size = 27777 5.089 ..tim.. unit as-is 1..27777 Size = 27777
Please note updated text, results and attached files (in the previous post).
If you disable every statement except the return in my custom binarySearch
routine, it becomes only an approximation to the correct result. Hut it is still useful enough in an array of less than 50,000 for testing purposes. But means that the overhead in calling wth the array t
sliced to the indices relevant to the task being performed at this pass
bnarySearch(t[_lo to _hi by s], trightlow);
is the calling sequence.
The slowdown seen is about 50% which I think means that the calling sequence when it includes a slice is 50% of the work involved in the merger of two sorted halves of an array, i.e. the copying of the array back into itself which is not an insignificant amount of work in its own right.. I do not know how it works but as Brad noted in another post which I cannot find, maybe it needs to use some sort of dope vector approach which is the way Algol worked all those years ago.
https://github.com/chapel-lang/chapel/pull/24390 will address some of the most common slice and rank-change operations in the context of whole-array operations.
As a more future-looking way of expanding what that PR does, I'll capture some thoughts here. That PR introduces the type chpl__protoSlice
, which is hidden from the user and only used by the compiler if it can determine a statement suitable for using proto slices. In an ideal world (and not thinking too deeply about how one might be able to implement something like that), we can make all slicing/rank-changing expressions create such a type instead of an array view as they do today. Then, objects of that instance could determine whether they need to be "upgraded" to a full-blown array view depending on the context in which they are used. Now, I don't think objects themselves can make that decision, but something like that could be implemented as a coercion of sorts in the compiler.
Note that the current compiler implementation after that PR is more proactive where the compiler introduces values of that type based on very specific patterns, whereas what I am suggesting is more reactive where values of that type will be created in all cases. But that they could be coerced in some scenarios reactively.
That's because while I think there's a ton of room improving upon that PR to make whole-array operations involving array views faster, expanding it to some cases that are mentioned in this issue takes a different compiler approach.
The performance of array slice and rank change operations is not what users expect. Users who are coming from other languages (at least Fortran and I think numpy) expect these slice operations to be fast but they have high overhead in Chapel today. I think we need to prioritize improving this situation as it has been a show-stopper for some potential users. I'm creating this issue to help us to track and prioritize this effort.
Related issues / posts:
8203
10763
22756
Implementation ideas & workarounds:
5368
14530
16133
20749
23890