BonzaiThePenguin / WikiSort

Fast and stable sort algorithm that uses O(1) memory. Public domain.
The Unlicense
1.27k stars 95 forks source link

Gah, why is the Java version so slow? #22

Open BonzaiThePenguin opened 10 years ago

BonzaiThePenguin commented 10 years ago

This has been bothering me for a while, but the Java version is nowhere near the performance of a standard merge sort. I spent most of yesterday removing all class allocations (no Range, no Pull, and no Iterator), then when that didn't help I started ripping out chunks of code bit by bit, and eventually I got it down to a standard bottom-up merge sort and realized it was still a lot slower than a standard recursive merge sort. This doesn't make any sense.

There has to be something weird about this Java VM's internal design that causes it to not like the way the items in the array are being accessed, or something. The recursive version always accesses elements in the same areas, but the iterative version iterates over the array log(n) times. The number of accesses are no different, but the order of the accesses changes quite a bit. In the C version the iterative merge sort is much faster than the recursive one, which is what I would expect.

Maybe Java has its own caching system running behind the scenes, separately from any bare metal hardware caching...? Or maybe it isn't communicating its access intentions to the system properly, resulting in a ton of cache misses?

Anyway, I asked about it on Stack Overflow here: http://stackoverflow.com/questions/23121831/why-is-my-bottom-up-merge-sort-so-slow-in-java

artob commented 10 years ago

Not sure if it might help in this case, but the Mechanical Sympathy blog (from the creators of the LMAX Disruptor) has lots of great material on how to make JVM code observe hardware effects (cache line alignment, false sharing, etc) where possible. For example, a couple of representative posts from the archives:

BonzaiThePenguin commented 10 years ago

Thanks for the help. One thing that article made me realize was that the Java version is unique in that the Test class is allocated on the heap rather than inline within the array itself. The difference in access patterns for the parent array could easily change the amount of cache misses there.

I'm also going to see if a post-order bottom-up traversal is faster than the standard bottom-up design, as that merges the same areas of the array multiple times before moving on to the next chunk, and only traverses the entire array once. This isn't something that could be adapted to WikiSort easily, unfortunately, but it'd at least help explain where the slowdown is coming from.

BonzaiThePenguin commented 10 years ago

That write combining one is interesting. It's possible the C and C++ compilers are transforming the code behind-the-scenes to match the hardware's features, while the Java version makes no such attempt.

artob commented 10 years ago

Yeah, the tricks you have to pull in Java to get to parity with C++ are pretty interesting and sometimes counterintuitive. That blog's fascinating reading, and the performance they've achieved with the Disruptor is impressive. Lots of relevant material (and updates on the posts) in the comments as well.

BonzaiThePenguin commented 10 years ago

I switched the C++ version over to an array of pointers to Test classes, but it didn't suffer the same slowdown the Java version experienced. Argh... not sure I'm willing to redesign the algorithm just for the Java version, since I doubt TimSort is going anywhere anytime soon.

artob commented 10 years ago

Pinging @mjpt777, in case he might be willing to lend some of his insight on this question. (And a fast WikiSort implementation in Java might conceivably well benefit him, too.)