Implement random greedy descent Ising solver

jackraymond commented 2 years ago

Randomized greedy descent means that spins are addressed at random. This could mean several different things:

[1] Spins are selected uniformly at random on each update. This means that in a single sweep of N updates, less than N spins are typically updated (typically exp(-1)N are left untouched), this will typically lead to inferior performance as an optimization heuristic, but allows certain theoretical guarantees on behaviour (in particular, some pathologies like oscillation at low temperature can be avoided). This is also the mode that best matches the 'Glauber dynamics' notion from physics, the mode we should use for emulation of classical physical dynamics.
Spins are selected uniformly at random without replacement (per sweep).I.e. on each sweep we update all spins in a permuted order. For practical purposes this is very close to the above process, and can be more performant in optimization, particularly for small numbers of sweep.
Per anneal process (per sample, or per batch of samples) the ordering is determined by a random permutation. I.e. we have a fixed ordering of update but determined randomly.

By contrast one might consider two non-random orderings of high interest:

[2] 'Sequential ordering'. I.e we update all spins in the order 1:N (or lexicographically - e.g. default bqm ordering). Assuming people order their variables in a sensible manner (so that local variables have local labels, i.e. this is often achieved thoughtlessly) this method can (perhaps surprisingly) often outperform randomized ordering, since variables can flip sequentially in a correlated manner. If people know something about how to systematically search the space, they can exploit such a mode (by manually permuting their Hamiltonian). Such a mode is also useful for analysis, debugging and proof of concept.
'Colored ordering'. When a graph coloring is provided (heuristic or optimal doesn't matter), variables of equal color allow for parallel updating. If parallelization is not exploited at the hardware level, then this is basically equivalent to a special case of sequential ordering. Understanding colored ordering is important since alternative hardware platforms (like GPU) can make use of the conditional independence to improve latency.

In summary, the two most important generalizations of steepest greedy are probably [1] and [2] IMO. But accommodating other cases might be useful.

jackraymond commented 2 years ago

Note that the above comments also apply to variations upon dwave-neal. This might be considered a seperate feature request. dwave-neal currently uses sequential ordering, but could be generalized to accept random ordering.

randomir commented 2 years ago

Thanks @jackraymond. I considered some of these (along with a few other), but it's great to have a specific use case in mind when thinking which one to implement first (and how).

jackraymond commented 2 years ago

It's not an exhaustive list. I could definitely make regular use of [1,2] though.

CatherineCMcGeoch commented 2 years ago

I believe we should aim for a highly efficient implementation that doesn't completely trash the guarantees of randomness. This would puts the kibosh on [1] (a call to the system RNG inside a tight innermost loop is out of the question efficiency-wise, if that's what you really meant) and [2] (fixed iteration order is most vulnerable to worst-case instead of average-case performance that randomness is supposed to guarantee). I would also rule out the colored-node version due to the overhead costs of computing the coloring. If you want to spend time finding a good-quality coloring, don't waste it on a greedy approach that is going to stop at the first local minimum it finds.

CatherineCMcGeoch commented 2 years ago

To my mind the option of generating a random permutation of nodes, during initialization, then stepping through it in the innermost loop, offers the best combination of efficiency and randomized robustness. For variety, each new sample could start at a random location in the permutation, and/or could incorporate random step sizes each time. This would provide the desired variation among samples, and choosing the ``next'' node in the greedy iteration would take a couple nanoseconds for an array access instead of, idk, hundreds of microseconds for a system call?

randomir commented 2 years ago

Re: random number generator speed -- we don't have to hard-code it to a slow(er) one. We can use xorshift128+ like dwave-neal does.

CatherineCMcGeoch commented 2 years ago

Sure fast RNG is fine with me, as long as there is decent variation from sample to sample.

It might be worth checking whether it really is faster than building single permutation array and re-using it with only the overhead cost of array access.

CatherineCMcGeoch commented 2 years ago

I guess given Jack's preferences maybe this should be a user parameter?

jackraymond commented 2 years ago

I believe we should aim for a highly efficient implementation that doesn't completely trash the guarantees of randomness. This would puts the kibosh on [1] (a call to the system RNG inside a tight innermost loop is out of the question efficiency-wise, if that's what you really meant) and [2] (fixed iteration order is most vulnerable to worst-case instead of average-case performance that randomness is supposed to guarantee). I would also rule out the colored-node version due to the overhead costs of computing the coloring. If you want to spend time finding a good-quality coloring, don't waste it on a greedy approach that is going to stop at the first local minimum it finds.

I don't see why slower speed would be a killer objection to [1], but it could be an argument against it being defaulted or being implemented as a first priority, particularly for optimization use cases. I think that following best-practice, common and simple conventions for randomness is more important. Particularly for comparison of classical dynamics and quantum dynamics, or for evaluation of pathological cases (those where randomization matters most). Obtaining a heuristic coloring can be achieved in O(N) operations, and for some common special cases (like coloring bipartite graphs, e.g. Chimera, RBM) the colorings are optimal. So I don't see that as a killer objection either.

CatherineCMcGeoch commented 2 years ago

Fair enough. I want to use this code for benchmarking against the QPU. I want to claim truthfully that this is a state-of-the art implementation that runs as efficiently as we (i.e. Rad) could make go. I want to defuse any suspicion that D-Wave deliberately made it run slower than it should. In the larger field of classical solvers, Greedy is not going to win prizes for best solution quality'' --- all it has going for it is speed at banging out samples. So that should be optimized as much as possible. At the same time, code that is advertised as implementing randomized greedy should be recognizably random, and return an appropriately diverse distribution of solutions. The closer totheoretical'' randomness the better, unless the code produces massive slowdowns violating the first goal. Using a fixed node traversal order in a solver called ``random greedy'' seems like false advertising to me.

CatherineCMcGeoch commented 2 years ago

I don't believe parallel updating based on pre-computing a coloring would lead to any real speedup if the code itself is not actually parallelized. But I'm assuming that the updating is done on the fly so there can be multiple greedy bit flips per sweep -- maybe that assumption is wrong?

jackraymond commented 2 years ago

In my experience, on lattices and bicliques for a variety of models, a colored ordering is typically worse for heuristic optimization than a random permutation (because correlated block moves are obstructed) when evaluated on a per sweep basis. Using a fixed ordering, I can achieve a colored ordering by relabeling my variables, if I wanted to. I am not particularly interested in that use case, unless it is exploited at hardware level. In matlab, I have exploited heuristic+exact colored orderings for very large speedups in the past.

randomir commented 2 years ago

We do one bit flip per sweep. At each step (sweep) the best direction/dimension/variable is picked (by energy contribution) and flipped.

Parallel updates based on coloring sound nice, but I doubt we would see a significant speedup on graphs of our typical scale (~10k nodes).

jackraymond commented 2 years ago

One other place we need to consider the role of randomness is the case of tied energies in SGD. How is a tie broken when steepness is tied, we should as a minima document how this is done given that we often study models with degenerate energy levels (the source of ties). Several possibilities exist, probably the most natural random and systematic choices are : broken uniformly at random, broken systematically towards the highest (numeric rank) variable.

jackraymond commented 2 years ago

@mhlr expressed an interest in the case of select uniformly at random amongst spin flips conditioned on them lowering energy.

In my enumeration above, I was implicitely assuming that a selected spin is only flipped if it lowers energy.

[1] is equal to this case. Although, the notion of 'time' becomes a bit different. The way I wrote everything falls into a 'sweep-based' notion of time, whereas thinking about small sets of moves is more like the 'n-fold way' detached from physical notions of time.

This brings up a question of how we judge convergence in [1], and also whether we allow zero-energy change flips. If only downward energy flips are performed, then once we have touched every spin without a change of energy we are converged. Equivalently once the set of downward-energy spin flips is empty we are converged. In case [1] we need to perform more than 1 sweep to meet this criteria - since some variables with potential to reduce energy may not be touched.

Allowing equal-energy flips is more consistent with the physical dynamics interpretation - (i.e. limit of zero temperature metropolis or gibbs sampling). However, in this case we have no guarantees of convergence, so we might need to think about this more. The set of zero or downward energy spin flips may not become empty when degenerate energy levels are present. Perhaps the best we can do is exit once the set no longer contains downward energy moves (although, this is not a guarantee against the energy going down if we ran for longer), or perhaps once we go once sweep with no decrease. Another natural option might be to break the symmetry (equiv to applying an infinitesimal external field), to alleviate the problem of zero energy, and then apply the standard convergence criteria.

mhlr commented 2 years ago

I was thinking of only non-zero flips since I would like the process to self terminate.

randomir commented 2 years ago

@jackraymond, our current SGD implementation is detached from any physical interpretation (as long as we consider the Hamiltonian to be nothing more than a function we're minimizing). At each step, we pick the first dimension (in BQM variable order) that improves objective the most. We stop when we reach a (local) minimum.

If you think we need something more physics-inspired, we can consider generalizing simulated annealing sampler as well.

jackraymond commented 2 years ago

My presentation of methods was more suitable for the sweep based methods (simulated annealing). greedy routines can be considered a 'quenching' limit of an SA routine, but it was pretty unhelpful to do it that way. My preferred implementation matches that of @mhlr [and matches [1] under a particular interpretation]. Keep a list of downward energy flips and select from it uniformly at random (adjusting set after each update).

CatherineCMcGeoch commented 2 years ago

Hi Rad, so is the only randomness in the SGD you describe due to initialization? I would prefer a version with a random tie-breaker (based on sweep order perhaps) rather than always preferring the lowest-index node --- more robust against worst-case scenarios. Any objections to doing that?

jackraymond commented 2 years ago

Whether systematic or random tie-breaking applies, most important thing is to document. If only randomized were available, it is possible in general to recover the rank ordering behaviour by adding a perturbation on the bqm linear terms of the form eps*range(num_var), (eps small compared to all gaps) so the randomized ordering is sufficient for me.

randomir commented 2 years ago

@CatherineCMcGeoch, I think I'd rather keep the current SGD deterministic, and implement a new random greedy descent solver (as this issue was originally about). Makes sense?

CatherineCMcGeoch commented 2 years ago

Sure, whatever you like. Are you going to keep the loop that sweeps through all n nodes? To my mind this is a weird coding quirk that physicists enjoy. In CS (greedy, SA, whatever) there is one loop that samples nodes in random order, flipping and updating on the fly. I assume that the extra loop makes it easier to simulate natural processes, but it doesn't seem like a good idea efficiency-wise.

CatherineCMcGeoch commented 2 years ago

And I think the concept of implementing separate solvers rather than one highly parameterized version that pleases everybody, is a good idea too. A broad-ranging benchmark test would want to try them all, anyway.

randomir commented 2 years ago

No, I agree, a random sampler does not need the sweeping loop.

I mean, we actually have a little known version of the SGD that uses a red-black tree to keep energy gains per variable flip sorted, so the selection of the steepest flip is O(1). But due to larger constant factor, this version is faster only for very large sparse problems.

CatherineCMcGeoch commented 2 years ago

Red black trees! I used to teach them! Or sometimes splay trees, just to mix things up! Though I did see a talk a several years ago suggesting that plain vanilla BSTs typically work just fine in most applications.

dwavesystems / dwave-greedy

Implement random greedy descent Ising solver #1