defuse / crackstation-hashdb

CrackStation.net's Lookup Table Implementation.
GNU General Public License v3.0
368 stars 107 forks source link

NTLM sort takes forever #6

Open defuse opened 8 years ago

defuse commented 8 years ago

The sorting of an NTLM index is taking forever. I'm guessing that's because there's tons of passwords with the same 7-character prefix in a row, and it's causing quicksort to run in n^2 time instead of nlogn.

defuse commented 8 years ago

The current code uses:

int64_t pivotIndex = lowerIdx + (upperIdx-lowerIdx)/2; /* TODO: Median method. But: This is good enough since the data is either sorted or randomly distributed. */
    struct IndexEntry pivotValue = sortBuffer[pivotIndex];

That's non-optimal if there are a lot of repeated values, since it's likely they'll all get put into the second half.

defuse commented 8 years ago

The options seem to be 3-split quicksort:

https://stackoverflow.com/questions/5126586/quicksort-complexity-when-all-the-elements-are-same

Or randomize the choice of whether things equal to the pivot go left or right. This works because after the two halves are recursively sorted, all the pivot-equal values will be rightmost, and in the sorted right side all the pivot-equal values will be leftmost.

defuse commented 8 years ago

I took the randomized approach... It's running on the NTLM index now so we'll see if it's fast and produces the correct result. (Oops, I forgot to recompile it!)

defuse commented 8 years ago
taylor@defuse:~$ time ./crackstation-hashdb/sortidx -r 16384 cs_indexes/NTLM.idx
Index sort complete.

real    38m21.737s
user    9m38.260s
sys     22m34.124s

damn that's fast!

defuse commented 8 years ago

And... it's sorted according to checksort! It works!