taeguk commented 7 years ago

Currently, the worst-case time complexity of parallel::sort seems to be O(N^2).

https://github.com/STEllAR-GROUP/hpx/blob/d55dc1b220b6ba78a715a38779180f995b632ae2/hpx/parallel/algorithms/sort.hpp#L80-L89 https://github.com/STEllAR-GROUP/hpx/blob/d55dc1b220b6ba78a715a38779180f995b632ae2/hpx/parallel/algorithms/sort.hpp#L95-L97 Surely, due to the two code blocks above, the possibility is very rare. But, theoretically it still seems to be able to be O(N^2) in the worst-case. Normally, for ensuring O(NlogN), switch algorithm to heap sort when recursion depth is higher than specific threshold. If we want to ensure O(NlogN), we should modify Line 80 to

if (std::size_t(N) <= sort_limit_per_task || depth > recursion_depth_limit)

But, I'm afraid of performance change from those modifications.

I want to hear your thoughts about this.

taeguk commented 7 years ago

@biddisco @hkaiser Because at first this algorithm was developed by you, I want to hear your thoughts.

hkaiser commented 7 years ago

@taeguk the algorithm was developed by Francisco (@fjtapia). John and I have just adapted it to HPX.

biddisco commented 7 years ago

@taeguk - Tell me what are the rare circumstances where std::size_t(N) <= sort_limit_per_task could be false, but depth > recursion_depth_limit could be true?

Since the list is subdivided on each recursion - surely the list will always get shortened (by roughly 1/2) - are you implying that there are cases where the list might be split on recursion, but by some factor that is way off 1/2 - and (say) only one element goes to one half of the tree and N-1 go to the other?

If that is the case, then yes, changing the switchover point by adding the recursion depth would help.

taeguk commented 7 years ago

and (say) only one element goes to one half of the tree and N-1 go to the other? If that is the case, then yes, changing the switchover point by adding the recursion depth would help.

@biddisco Yes, I implied that case. For ensuring time complexity to O(N*logN) , we should limit recursion depth. But, what I care is performance change from those limitation. As you think, is there possibility that performance will be decreased because of those limitation? What should we do? Leave current parallel::sort ? Or Limit recursion depth for ensuring time complexity?

biddisco commented 7 years ago

My suggestion would be to run the algorithm over a bunch of tests datasets and see if the recursion depth ever significantly exceeds the nominal threshold required for NlogN behaviour. If it happens frequently - then yes, add a limit. In my own tests I have not yet encountered problems of that kind (that doesn't mean it can't happen though).

biddisco commented 7 years ago

Alternatively - add the extra check and verify that normal tests don't run any more slowly.

fjtapia commented 7 years ago

Hi,

I am reading the code of parallel_sort, and have a worst case of O(N²).

The way to prevent this is adding a counter with the number of subdivisions in the algorithm. In each division, decrement the counter , and when zero, sort with introsort.

Introsort have this counter, and begin to filter and divide with quicksort, and when zero, switch to heapsort, which have a worst case of O(NLogN), but is slower than quicksort.

Two years ago, when I wrote the code of sorting, only a few lines was used, in the actual SW of HPX.

But the smart mindset is looking for the solution. The solution exist since a year ago, as know John Biddiscombe. In my repository

https://github.com/fjtapia/HPX_sort you can find the code of all sorting algorithms, and the documentation about it. The same code had been approved recently in the Boost library.

single thread sort : introsort, with heap sort when the deep counter is 0
parallel sort : internally block_indirect_sort. Parallel unstable sort, very fast with very low additional memory
single thread stable_sort : internally spinsort
parallel stable_sort : internally parallel_stable_sort, based on samplesort, using a half of the additional memory of samplesort
samplesort : parallel stable sort algorithms

For to use, add #include <hpx/parallel/sort/sort.hpp>

This version of paralle sort have a worst case of O(NLogN), and resolve minor problems of previous versions. I had been testing this night, before to show you

The benchmark of these algorithms you can find in the attached file to this message. The parallel sort is notoriously faster than Intel Threading Building Blocks

Other solution is change the actual code and add a subdivision counter. In the file include/hpx/parallel/sort/detail/bis/parallel_sort.hpp you can see an easy and clear implementation of the subdivision counter.

If you are interested in use my code, I would like if someone help me in the development and optimization of the algorithm.

In the actual algorithms, I need threads which use thread local memory , previously initialized. But this feature is not in the C++ standard. Due this, the implementation create threads, and store the works in a stack, where the threads insert and extract the works, until it is empty.

This model run well in a machine where all the threads share the memory, but I am not sure of their response in a group of machines, linked by a network. If someone can run several benchmarks in a HPX with a small number of machines connected by a network, can be very useful.

I would like, too, know internal details of HPX. When you run a new thread, have any priority for to use the thread which create it? This can be important to minimize the traffic of data between machines and improve the cache performance.

Sometimes , I feel blind, as in the previous version of the algorithms, where, trying to do with HPX threads, generate a pool of initialized buffers in an atomic queue, and the threads, when start, extract one from it. There was the same number of threads than buffers. But sometimes some threads make more than 1000 attempts. In a single machine, the performance is about 10% slower, but running in a group of machines , the performance was around 50%, as said me John Biddiscombe.

If there is not interest in my code, please, say me , and I will put aside, and will be only an admirer of the project, and clap your work

Yours

Francisco Tapia

****************************************************************
**                                                            **
**         H P X : : P A R A L L E L : : S O R T              **
**                                                            **
**                   B E N C H M A R K                        **
**                                                            **
****************************************************************

Arquitectura:          x86_64
modo(s) de operación de las CPUs:32-bit, 64-bit
Orden de bytes:        Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Hilo(s) de procesamiento por núcleo:2
Núcleo(s) por «socket»:6
Socket(s):             1
Modo(s) NUMA:          1
ID de fabricante:      GenuineIntel
Familia de CPU:        6
Modelo:                63
Model name:            Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
Revisión:             2
CPU MHz:               1397.625
CPU max MHz:           3600,0000
CPU min MHz:           1200,0000
BogoMIPS:              6600.02
Virtualización:       VT-x
Caché L1d:            32K
Caché L1i:            32K
Caché L2:             256K
Caché L3:             15360K
NUMA node0 CPU(s):     0-11
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

  100000000 uint64_t elements already sorted
=================================================
GCC std::sort                : 2.65679 secs
HPX sort                     : 0.0856454 secs

GCC std::stable_sort         : 5.24768 secs
HPX  stable_sort             : 0.0713636 secs

TBB parallel_sort            : 0.0458935 secs
HPX parallel_sort            : 0.0876892 secs

TBB parallel stable sort     : 0.670789 secs
HPX parallel stable sort     : 0.066981 secs
HPX sample sort              : 0.0838082 secs

  100000000 uint64_t elements randomly filled
=================================================
GCC std::sort                : 9.27454 secs
HPX sort                     : 8.63995 secs

GCC std::stable_sort         : 8.94908 secs
HPX  stable_sort             : 9.06915 secs

TBB parallel_sort            : 1.8426 secs
HPX parallel_sort            : 1.10332 secs

TBB parallel stable sort     : 2.25538 secs
HPX parallel stable sort     : 2.0783 secs
HPX sample sort              : 2.02486 secs

  10000000 strings randomly filled
===============================================
GCC std::sort                : 7.55374 secs
HPX sort                     : 8.04349 secs

GCC std::stable_sort         : 13.5451 secs
HPX  stable_sort             : 9.32374 secs

TBB parallel_sort            : 2.172 secs
HPX parallel_sort            : 1.37218 secs

TBB parallel stable sort     : 2.46657 secs
HPX parallel stable sort     : 3.15335 secs
HPX sample sort              : 2.4512 secs

================================================================
=                OBJECT COMPARISON                             =
=              ---------------------                           =
=                                                              =
= The objects are arrays of 64 bits numbers                    =
= They are compared in two ways :                              =
=    (H) Heavy : The comparison is the sum of all the numbers  =
=                of the array                                  =
=    (L) Light : The comparison is with the first element of   =
=                the array, as a key                           =
=                                                              =
================================================================

100000000 elements of size 8 randomly filled 
=============================================

  H E A V Y   C O M P A R I S O N
====================================
GCC std::sort                : 9.24636 secs
HPX sort                     : 9.04937 secs

GCC std::stable_sort         : 10.9146 secs
HPX  stable_sort             : 9.34338 secs

TBB parallel_sort            : 1.83879 secs
HPX parallel_sort            : 1.17504 secs

TBB parallel stable sort     : 2.35981 secs
HPX parallel stable sort     : 2.54364 secs
HPX sample sort              : 2.07547 secs

  L I G H T   C O M P A R I S O N 
=======================================
GCC std::sort                : 9.61008 secs
HPX sort                     : 9.24874 secs

GCC std::stable_sort         : 11.2254 secs
HPX  stable_sort             : 9.33069 secs

TBB parallel_sort            : 1.89402 secs
HPX parallel_sort            : 1.5102 secs

TBB parallel stable sort     : 2.2112 secs
HPX parallel stable sort     : 2.54365 secs
HPX sample sort              : 2.0241 secs

50000000 elements of size 16 randomly filled 
=============================================

  H E A V Y   C O M P A R I S O N
====================================
GCC std::sort                : 5.09317 secs
HPX sort                     : 4.8348 secs

GCC std::stable_sort         : 6.1044 secs
HPX  stable_sort             : 5.6755 secs

TBB parallel_sort            : 1.04078 secs
HPX parallel_sort            : 0.696541 secs

TBB parallel stable sort     : 1.45137 secs
HPX parallel stable sort     : 1.53507 secs
HPX sample sort              : 1.34131 secs

  L I G H T   C O M P A R I S O N 
=======================================
GCC std::sort                : 5.04977 secs
HPX sort                     : 4.76086 secs

GCC std::stable_sort         : 5.80362 secs
HPX  stable_sort             : 5.05566 secs

TBB parallel_sort            : 0.963915 secs
HPX parallel_sort            : 0.653442 secs

TBB parallel stable sort     : 1.34845 secs
HPX parallel stable sort     : 1.31352 secs
HPX sample sort              : 1.20746 secs

25000000 elements of size 32 randomly filled 
=============================================

  H E A V Y   C O M P A R I S O N
====================================
GCC std::sort                : 3.15829 secs
HPX sort                     : 2.95424 secs

GCC std::stable_sort         : 4.0104 secs
HPX  stable_sort             : 3.3834 secs

TBB parallel_sort            : 0.615975 secs
HPX parallel_sort            : 0.697791 secs

TBB parallel stable sort     : 0.962468 secs
HPX parallel stable sort     : 1.07537 secs
HPX sample sort              : 0.891413 secs

  L I G H T   C O M P A R I S O N 
=======================================
GCC std::sort                : 2.85442 secs
HPX sort                     : 2.61642 secs

GCC std::stable_sort         : 3.58202 secs
HPX  stable_sort             : 2.82125 secs

TBB parallel_sort            : 0.645143 secs
HPX parallel_sort            : 0.48376 secs

TBB parallel stable sort     : 0.919681 secs
HPX parallel stable sort     : 0.925216 secs
HPX sample sort              : 0.776533 secs

12500000 elements of size 64 randomly filled 
=============================================

  H E A V Y   C O M P A R I S O N
====================================
GCC std::sort                : 2.47338 secs
HPX sort                     : 2.21567 secs

GCC std::stable_sort         : 3.28297 secs
HPX  stable_sort             : 2.74118 secs

TBB parallel_sort            : 0.589234 secs
HPX parallel_sort            : 0.465219 secs

TBB parallel stable sort     : 0.901908 secs
HPX parallel stable sort     : 0.914365 secs
HPX sample sort              : 0.772351 secs

  L I G H T   C O M P A R I S O N 
=======================================
GCC std::sort                : 1.77783 secs
HPX sort                     : 1.67127 secs

GCC std::stable_sort         : 2.81426 secs
HPX  stable_sort             : 2.17918 secs

TBB parallel_sort            : 0.482603 secs
HPX parallel_sort            : 0.418259 secs

TBB parallel stable sort     : 0.792557 secs
HPX parallel stable sort     : 0.802242 secs
HPX sample sort              : 0.689103 secs

6250000 elements of size 128 randomly filled 
=============================================

  H E A V Y   C O M P A R I S O N
====================================
GCC std::sort                : 2.09116 secs
HPX sort                     : 1.84758 secs

GCC std::stable_sort         : 2.86589 secs
HPX  stable_sort             : 2.27605 secs

TBB parallel_sort            : 0.518487 secs
HPX parallel_sort            : 0.434963 secs

TBB parallel stable sort     : 0.839052 secs
HPX parallel stable sort     : 0.853302 secs
HPX sample sort              : 0.745378 secs

  L I G H T   C O M P A R I S O N 
=======================================
GCC std::sort                : 1.47552 secs
HPX sort                     : 1.41978 secs

GCC std::stable_sort         : 2.64545 secs
HPX  stable_sort             : 1.84815 secs

TBB parallel_sort            : 0.471656 secs
HPX parallel_sort            : 0.404039 secs

TBB parallel stable sort     : 0.739945 secs
HPX parallel stable sort     : 0.776035 secs
HPX sample sort              : 0.705195 secs

3125000 elements of size 256 randomly filled 
=============================================

  H E A V Y   C O M P A R I S O N
====================================
GCC std::sort                : 2.09095 secs
HPX sort                     : 2.09735 secs

GCC std::stable_sort         : 2.85211 secs
HPX  stable_sort             : 2.44499 secs

TBB parallel_sort            : 0.567121 secs
HPX parallel_sort            : 0.472408 secs

TBB parallel stable sort     : 0.834263 secs
HPX parallel stable sort     : 0.900596 secs
HPX sample sort              : 0.780577 secs

  L I G H T   C O M P A R I S O N 
=======================================
GCC std::sort                : 1.72776 secs
HPX sort                     : 1.74781 secs

GCC std::stable_sort         : 2.49251 secs
HPX  stable_sort             : 1.99043 secs

TBB parallel_sort            : 0.447485 secs
HPX parallel_sort            : 0.446629 secs

TBB parallel stable sort     : 0.773182 secs
HPX parallel stable sort     : 0.810118 secs
HPX sample sort              : 0.689863 secs

1562500 elements of size 512 randomly filled 
=============================================

  H E A V Y   C O M P A R I S O N
====================================
GCC std::sort                : 2.67612 secs
HPX sort                     : 2.62147 secs

GCC std::stable_sort         : 2.99261 secs
HPX  stable_sort             : 2.66403 secs

TBB parallel_sort            : 0.616895 secs
HPX parallel_sort            : 0.561857 secs

TBB parallel stable sort     : 0.859935 secs
HPX parallel stable sort     : 0.997386 secs
HPX sample sort              : 0.779245 secs

  L I G H T   C O M P A R I S O N 
=======================================
GCC std::sort                : 1.55995 secs
HPX sort                     : 1.52954 secs

GCC std::stable_sort         : 2.43486 secs
HPX  stable_sort             : 1.87303 secs

TBB parallel_sort            : 0.453259 secs
HPX parallel_sort            : 0.430468 secs

TBB parallel stable sort     : 0.759556 secs
HPX parallel stable sort     : 0.779795 secs
HPX sample sort              : 0.631015 secs

taeguk commented 7 years ago

@fjtapia Thank you for your reply and sorry to my late response. I read https://github.com/fjtapia/HPX_sort/blob/master/block_indirect_sort_en.pdf. Your idea looks so awesome! I want to consider your better parallel sorting algorithms for resolving this issue and getting better performance. But, I should re-implement your implementation with HPX-tic way. But, before I consider your idea, I have to ask you. Can I use your idea for HPX?

fjtapia commented 7 years ago

Hi Taeguk,

Thanks by your interest and your comments.

The problem of the worst case O(N²), is resolved . I sent the modified file to Hartmut Kaiser some weeks ago. If you want take a look is the first attached file to this message.

About the substitution of the actual code by block_indirect_sort, it’s not so easy as appear.

First, must be interest in the change, and if exist, there is a long way and a hard work for to do.

Block_indirect_sort use threads which reuse initialized memory. This memory is initialized and destroyed only one time , and reuse many times. The C++ standard don’t support it. Due this, we implement our thread pool with reused memory.

In a thread pool, when you throw a new thread, the function to execute and the parameters are encapsulated and stored until a thread of the pool is free, and then run the encapsulate information until finish and pass to be a free thread.

In the thread pool of block indirect sort, when create the HPX threads of the pool, initialize the buffers of memory, with a free state. Then each thread execute the encapsulate information stored . The main difference is that encapsulate information can access to the buffer create in the initialization. When the HPX thread of the pool is destroyed, destroy too the buffer.

The idea of to do using the thread pool, present several problems and several alternatives.

A year and half ago, I did an implementation whith a pool of initialized buffers, and each thread, when start, take a buffer of the pool, and when finish return the buffer to the pool. There was the same number of buffers and threads. I don’t know the internal mechanism of creation of threads in HPX, but some threads need more than 1000 attempts for to obtain a buffer.

This, running in a single machine, had a penalty around 10% of the previous version. But when John Biddiscombe ran over a network of machines, the penalty was greater than the 40%.

If exist interest in the change, I have in mind have a version, where each thread , create and initialize their own buffer, and destroy when finish. This insert a penalty in each thread, and I expect to be lower than 10%.

But these things, only cover up the big problem. In the algorithms have two kinds of works: calculus and data movements.

When the algorithm run over a single machine, the data movement is done in the internal memory of the machine, and the time consumption is very low. But when you run over a network of machines, the data movement can ruin the performance of the algorithm. And there is a big difference in the configuration. It is not the same machines connected with a triple 40 G Infiniband bus, to machines connected with a single 1G Ethernet .

We need make test and time measures in different machines. If this problem appear, there are internal parameters and strategies of the algorithm, which can use to mitigate the problem.

But in any case, we will need the collaboration of the HPX users, running a predefined test in their machines, and compare and evaluate the results.

Then, and only if the results justify it, it will be the moment of substitute the actual internal parallel sort algorithm, by block_indirect_sort.

The goal of this is provide to HPX the best algorithms as possible, and many things of this are new and unexplored, but HPX will have it because is working hard in the develop and debugging of the algorithm.

In a few weeks, I will actualize the repository, with a new version with some fails corrected, and with all the algorithms running in the namespace hpx::parallel::tr1, for to run with the actual code and do comparisons. The second file attached to this message are the results of this new version running on a single benchmark, comparing the proposed code (HPX tr1::parallel_sort) with the actual code. (HPX parallel_sort)

Yours

Francisco

2017-09-10 17:30 GMT+02:00 Taeguk Kwon notifications@github.com:

@fjtapia https://github.com/fjtapia Thank you for your reply and sorry to my late response. I read https://github.com/fjtapia/HPX_sort/blob/master/block_ indirect_sort_en.pdf. Your idea looks so awesome! I want to consider your better parallel sorting algorithms for resolving this issue and getting better performance. But, maybe I should re-implement your implementation with HPX-tic way. But, before I consider your idea, I have to ask you for you. Can I use your idea for HPX?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/STEllAR-GROUP/hpx/issues/2836#issuecomment-328350149, or mute the thread https://github.com/notifications/unsubscribe-auth/ADrumJE2uVqgMKz4DqMs_tYIbLZM4is5ks5shACkgaJpZM4O4dyA .

H P X : : P A R A L L E L : : S O R T

B E N C H M A R K

Arquitectura: x86_64 modo(s) de operación de las CPUs:32-bit, 64-bit Orden de bytes: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Hilo(s) de procesamiento por núcleo:2 Núcleo(s) por «socket»:6 Socket(s): 1 Modo(s) NUMA: 1 ID de fabricante: GenuineIntel Familia de CPU: 6 Modelo: 63 Model name: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Revisión: 2 CPU MHz: 1392.791 CPU max MHz: 3600,0000 CPU min MHz: 1200,0000 BogoMIPS: 6599.92 Virtualización: VT-x Caché L1d: 32K Caché L1i: 32K Caché L2: 256K Caché L3: 15360K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

100000000 uint64_t elements already sorted

GCC std::sort : 2.60241 secs HPX sort : 0.087891 secs

GCC std::stable_sort : 5.28991 secs HPX stable_sort : 0.0691075 secs

TBB parallel_sort : 0.037605 secs HPX tr1::parallel_sort : 0.0749623 secs HPX parallel_sort : 0.0647351 secs

TBB parallel stable sort : 0.706968 secs HPX parallel stable sort : 0.0838772 secs HPX sample sort : 0.0702617 secs

100000000 uint64_t elements randomly filled

GCC std::sort : 9.02531 secs HPX sort : 8.68727 secs

GCC std::stable_sort : 8.98621 secs HPX stable_sort : 9.02838 secs

TBB parallel_sort : 1.763 secs HPX tr1::parallel_sort : 1.09071 secs HPX parallel_sort : 1.55907 secs

TBB parallel stable sort : 2.07174 secs HPX parallel stable sort : 2.41496 secs HPX sample sort : 1.97174 secs

10000000 strings randomly filled

GCC std::sort : 7.44621 secs HPX sort : 7.92181 secs

GCC std::stable_sort : 13.3771 secs HPX stable_sort : 9.19583 secs

TBB parallel_sort : 2.13058 secs HPX tr1::parallel_sort : 1.33244 secs HPX parallel_sort : 2.59292 secs

TBB parallel stable sort : 2.34349 secs HPX parallel stable sort : 3.12607 secs HPX sample sort : 2.48383 secs

================================================================ = OBJECT COMPARISON = = --------------------- = = = = The objects are arrays of 64 bits numbers = = They are compared in two ways : = = (H) Heavy : The comparison is the sum of all the numbers = = of the array = = (L) Light : The comparison is with the first element of = = the array, as a key = = =

100000000 elements of size 8 randomly filled

H E A V Y C O M P A R I S O N

GCC std::sort : 9.05719 secs HPX sort : 8.91211 secs

GCC std::stable_sort : 10.9941 secs HPX stable_sort : 9.12832 secs

TBB parallel_sort : 1.84943 secs HPX tr1::parallel_sort : 1.49937 secs HPX parallel_sort : 1.5884 secs

TBB parallel stable sort : 2.26178 secs HPX parallel stable sort : 2.52649 secs HPX sample sort : 2.0419 secs

L I G H T C O M P A R I S O N

GCC std::sort : 9.60662 secs HPX sort : 8.95858 secs

GCC std::stable_sort : 11.1955 secs HPX stable_sort : 9.2155 secs

TBB parallel_sort : 1.82679 secs HPX tr1::parallel_sort : 1.11486 secs HPX parallel_sort : 1.65892 secs

TBB parallel stable sort : 2.17865 secs HPX parallel stable sort : 2.52737 secs HPX sample sort : 2.02748 secs

50000000 elements of size 16 randomly filled

H E A V Y C O M P A R I S O N

GCC std::sort : 5.16509 secs HPX sort : 4.88129 secs

GCC std::stable_sort : 6.08995 secs HPX stable_sort : 5.5444 secs

TBB parallel_sort : 0.973186 secs HPX tr1::parallel_sort : 0.686343 secs HPX parallel_sort : 0.984494 secs

TBB parallel stable sort : 1.4443 secs HPX parallel stable sort : 1.53522 secs HPX sample sort : 1.39987 secs

L I G H T C O M P A R I S O N

GCC std::sort : 5.01757 secs HPX sort : 4.6719 secs

GCC std::stable_sort : 5.72705 secs HPX stable_sort : 4.97378 secs

TBB parallel_sort : 0.927114 secs HPX tr1::parallel_sort : 0.644089 secs HPX parallel_sort : 0.823175 secs

TBB parallel stable sort : 1.32537 secs HPX parallel stable sort : 1.30084 secs HPX sample sort : 1.14904 secs

25000000 elements of size 32 randomly filled

H E A V Y C O M P A R I S O N

GCC std::sort : 3.23151 secs HPX sort : 2.96121 secs

GCC std::stable_sort : 4.05332 secs HPX stable_sort : 3.4457 secs

TBB parallel_sort : 0.661461 secs HPX tr1::parallel_sort : 0.515677 secs HPX parallel_sort : 0.618428 secs

TBB parallel stable sort : 1.00107 secs HPX parallel stable sort : 1.09945 secs HPX sample sort : 0.901703 secs

L I G H T C O M P A R I S O N

GCC std::sort : 2.81681 secs HPX sort : 2.58426 secs

GCC std::stable_sort : 3.60017 secs HPX stable_sort : 2.7965 secs

TBB parallel_sort : 0.533744 secs HPX tr1::parallel_sort : 0.494393 secs HPX parallel_sort : 0.583845 secs

TBB parallel stable sort : 0.874329 secs HPX parallel stable sort : 0.920968 secs HPX sample sort : 0.7335 secs

12500000 elements of size 64 randomly filled

H E A V Y C O M P A R I S O N

GCC std::sort : 2.47829 secs HPX sort : 2.22867 secs

GCC std::stable_sort : 3.27704 secs HPX stable_sort : 2.72488 secs

TBB parallel_sort : 0.619403 secs HPX tr1::parallel_sort : 0.469134 secs HPX parallel_sort : 0.555556 secs

TBB parallel stable sort : 0.845072 secs HPX parallel stable sort : 0.919527 secs HPX sample sort : 0.792887 secs

L I G H T C O M P A R I S O N

GCC std::sort : 1.79291 secs HPX sort : 1.65653 secs

GCC std::stable_sort : 2.81221 secs HPX stable_sort : 2.165 secs

TBB parallel_sort : 0.499499 secs HPX tr1::parallel_sort : 0.431813 secs HPX parallel_sort : 0.474852 secs

TBB parallel stable sort : 0.778591 secs HPX parallel stable sort : 0.830544 secs HPX sample sort : 0.685677 secs

6250000 elements of size 128 randomly filled

H E A V Y C O M P A R I S O N

GCC std::sort : 2.09258 secs HPX sort : 1.83383 secs

GCC std::stable_sort : 2.83864 secs HPX stable_sort : 2.22892 secs

TBB parallel_sort : 0.534752 secs HPX tr1::parallel_sort : 0.433769 secs HPX parallel_sort : 0.487857 secs

TBB parallel stable sort : 0.817399 secs HPX parallel stable sort : 0.849965 secs HPX sample sort : 0.736171 secs

L I G H T C O M P A R I S O N

GCC std::sort : 1.45487 secs HPX sort : 1.42989 secs

GCC std::stable_sort : 2.61524 secs HPX stable_sort : 1.8462 secs

TBB parallel_sort : 0.471152 secs HPX tr1::parallel_sort : 0.3896 secs HPX parallel_sort : 0.435696 secs

TBB parallel stable sort : 0.73162 secs HPX parallel stable sort : 0.787419 secs HPX sample sort : 0.63329 secs

3125000 elements of size 256 randomly filled

H E A V Y C O M P A R I S O N

GCC std::sort : 1.47963 secs HPX sort : 1.46861 secs

GCC std::stable_sort : 2.49858 secs HPX stable_sort : 1.90743 secs

TBB parallel_sort : 0.439718 secs HPX tr1::parallel_sort : 0.394152 secs HPX parallel_sort : 0.450641 secs

TBB parallel stable sort : 0.760943 secs HPX parallel stable sort : 0.799024 secs HPX sample sort : 0.64447 secs

L I G H T C O M P A R I S O N

GCC std::sort : 1.04243 secs HPX sort : 1.05095 secs

GCC std::stable_sort : 2.4425 secs HPX stable_sort : 1.6778 secs

TBB parallel_sort : 0.370537 secs HPX tr1::parallel_sort : 0.359533 secs HPX parallel_sort : 0.417489 secs

TBB parallel stable sort : 0.734163 secs HPX parallel stable sort : 0.725782 secs HPX sample sort : 0.664722 secs

1562500 elements of size 512 randomly filled

H E A V Y C O M P A R I S O N

GCC std::sort : 2.25914 secs HPX sort : 2.21181 secs

GCC std::stable_sort : 2.69855 secs HPX stable_sort : 2.29027 secs

TBB parallel_sort : 0.547796 secs HPX tr1::parallel_sort : 0.444853 secs HPX parallel_sort : 0.600206 secs

TBB parallel stable sort : 0.801792 secs HPX parallel stable sort : 0.882202 secs HPX sample sort : 0.724445 secs

L I G H T C O M P A R I S O N

GCC std::sort : 1.01165 secs HPX sort : 0.940796 secs

GCC std::stable_sort : 2.50147 secs HPX stable_sort : 1.68217 secs

TBB parallel_sort : 0.364049 secs HPX tr1::parallel_sort : 0.343097 secs HPX parallel_sort : 0.391673 secs

TBB parallel stable sort : 0.684558 secs HPX parallel stable sort : 0.744336 secs HPX sample sort : 0.611427 secs

taeguk commented 7 years ago

@fjtapia

This, running in a single machine, had a penalty around 10% of the previous version.

1 . Does this penalty means the overheads occurred by porting your original block_indirect_sort to HPX? That is, is there no penalty in original block_indirect_sort?

https://github.com/fjtapia/HPX_sort/blob/8c00d770f3e831f4914196c9386ff930e9c0b1de/include/hpx/parallel/sort/detail/bis/backbone.hpp#L167-L170 2 . And I think that above codes cause frequent thread switching through std::this_thread::yield(). And it is very harmful for performance. How do you think about it?

3 . HPX have individual implementations of parallel algorithms for on a single machine and on a network. (/hpx/parallel/algorithms/ for single machine, and /hpx/parallel/segmented_algorithms for network.) So I think it is better that we first implement the algorithm running on only a single machine. And then consider running over a network.

4 . How can I look attached file? I'm reading your reply on github. I can't find attached file on github.

shantanudwvd commented 6 years ago

i want to work on this. Can anyone guide me as i'm new to this?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed. Please re-open if necessary.

shantanudwvd commented 5 years ago

Hi, How is this issue resolved?

On Sat, Aug 3, 2019, 13:01 stale[bot] notifications@github.com wrote:

Closed #2836 https://github.com/STEllAR-GROUP/hpx/issues/2836.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/STEllAR-GROUP/hpx/issues/2836?email_source=notifications&email_token=AHKLWMVH573YIADFYR2XLX3QCUX3RA5CNFSM4DXB3SAKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOS3PPB6Q#event-2531193082, or mute the thread https://github.com/notifications/unsubscribe-auth/AHKLWMWF3JUZLJMLF3ZTIU3QCUX3RANCNFSM4DXB3SAA .

hkaiser commented 5 years ago

How is this issue resolved?

@shantanudwvd it was resolved as 'Won't Fix' as there was no activity for more than 6 month on this. If you are interested in helping to resolve the issue (or it is otherwise dear to you), please re-open it (as the comment above says).

fjtapia commented 5 years ago

Actually the worst case is O(N²)

The problem is resolved, but not implemented in the actual version. In this list of messages, in September 16 of 2017 I answered this question and attached a file with the code resolving the problem.

If you have any other question, please, contact me

Francisco Tapia

hkaiser commented 5 years ago

@fjtapia Francisco, would you mind re-attaching your reasoning you shared back then here, please?

fjtapia commented 5 years ago

I am a few days out of Madrid. The Next week I will prepare and send you the corrections

fjtapia commented 5 years ago

The HPX parallel sort is an implementation of the parallel introsort, which internally is a quicksort plus heapsort. Quicksort works, finding an element, called pivot, and filtering the elements according to this. One part is the elements smaller than the pivot and the others are the greatest. In this way divide the elements to sort in two parts. Ideally, the two parts must have the same or very similar size.

The filtering process we can represent as a tree. If all the partitions are perfect, we have a tree with nlevel levels. 2^ nlevels >= N, being N the number of elements to sort. With the perfect partitions, the algorithm is O(N log N).

But not all is perfect and can occur the worst case, and in the partition, we obtain one of N – 1 element and other of 1 element. If this appears in all the partitions we are in the worst case, and then, the algorithm is O ( N²).

To prevent this, we consider nlevels, the number of levels with perfect partitions. If the level reaches the double of this, change to heapsort. This algorithm is O ( N log N) in all the cases, but, in the practice is 3 or 4 times slower than quicksort.

In the parallel part of the algorithm, we make partitions and assign the partitions to threads. When the number of elements is slower than 2 ^16 = 65536 elements, we sort with 1 thread. According to this the levels of the parallel part are nlevels – 16. If you reach the double of this value change to introsort with 1 thread.

The implementation of introsort we select 3 values, sort and select the central as pivot. In this parallel version, we select the pivot between 9 values to improve the pivot selection and the partition.

In the zip file, you can find the file of hpx/parallel/algorithm/sort.hpp, and a folder with several test programs. Some of them included in the HPX , and the file test_hpx_sort.cpp wich test several data set with the algorithm.

If you see any error or any lack, please, say me to correct it as soon as possible.

Yours

Francisco Tapia

hkaiser commented 5 years ago

Thanks Francisco! Would you be able to post the link to the file you are referring to somewhere?

fjtapia commented 5 years ago

Unzip the attached file of my message, and you will find the sort.hpp file and a folder with the test programs used to check the correctness of the modifications

hkaiser commented 5 years ago

@fjtapia Francisco, unfortunately, GitHub does not post attached files to the comments for a ticket. Would you mind uploading the files somewhere (gist?) and post the link here instead? Thanks!

fjtapia commented 5 years ago

Sorry I didn't know it about Github I put ii my Google Drive.

The link is https://drive.google.com/drive/folders/1v-I9I9c0GKB5ruml8rTKFKpkTsi3FGCE?usp=sharing

If you have any problem, say me and I try other option

hkaiser commented 5 years ago

Got it! Thanks a lot!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed. Please re-open if necessary.

STEllAR-GROUP / hpx

The worst-case time complexity of parallel::sort seems to be O(N^2). #2836

100000000 uint64_t elements already sorted

100000000 uint64_t elements randomly filled

10000000 strings randomly filled

100000000 elements of size 8 randomly filled

H E A V Y C O M P A R I S O N

L I G H T C O M P A R I S O N

50000000 elements of size 16 randomly filled

H E A V Y C O M P A R I S O N

L I G H T C O M P A R I S O N

25000000 elements of size 32 randomly filled

H E A V Y C O M P A R I S O N

L I G H T C O M P A R I S O N

12500000 elements of size 64 randomly filled

H E A V Y C O M P A R I S O N

L I G H T C O M P A R I S O N

6250000 elements of size 128 randomly filled

H E A V Y C O M P A R I S O N

L I G H T C O M P A R I S O N

3125000 elements of size 256 randomly filled

H E A V Y C O M P A R I S O N

L I G H T C O M P A R I S O N

1562500 elements of size 512 randomly filled

H E A V Y C O M P A R I S O N

L I G H T C O M P A R I S O N