Open defuse opened 8 years ago
The current code uses:
int64_t pivotIndex = lowerIdx + (upperIdx-lowerIdx)/2; /* TODO: Median method. But: This is good enough since the data is either sorted or randomly distributed. */
struct IndexEntry pivotValue = sortBuffer[pivotIndex];
That's non-optimal if there are a lot of repeated values, since it's likely they'll all get put into the second half.
The options seem to be 3-split quicksort:
https://stackoverflow.com/questions/5126586/quicksort-complexity-when-all-the-elements-are-same
Or randomize the choice of whether things equal to the pivot go left or right. This works because after the two halves are recursively sorted, all the pivot-equal values will be rightmost, and in the sorted right side all the pivot-equal values will be leftmost.
I took the randomized approach... It's running on the NTLM index now so we'll see if it's fast and produces the correct result. (Oops, I forgot to recompile it!)
taylor@defuse:~$ time ./crackstation-hashdb/sortidx -r 16384 cs_indexes/NTLM.idx
Index sort complete.
real 38m21.737s
user 9m38.260s
sys 22m34.124s
damn that's fast!
And... it's sorted according to checksort! It works!
The sorting of an NTLM index is taking forever. I'm guessing that's because there's tons of passwords with the same 7-character prefix in a row, and it's causing quicksort to run in n^2 time instead of nlogn.