It seems like the bucket sort, requiring random memory access, is taking a lot of time. Here are two options:
Instead of using bucket sort, fill memory in one pass, then run quicksort (same protocol, potentially-faster implementation, because the sort is a more local operation).
Explore an entirely new protocol, where we just sort the (hash, preimage) pairs by the hash, divide that memory into chunks, and use those chunks as input to the outer PoW.
The generalized birthday PoW paper requires at least one full sort of the hash values, so maybe we could be even faster for proving if we only had one full sort as well.
It seems like the bucket sort, requiring random memory access, is taking a lot of time. Here are two options: