Qiskit / qiskit

Qiskit is an open-source SDK for working with quantum computers at the level of extended quantum circuits, operators, and primitives.
https://www.ibm.com/quantum/qiskit
Apache License 2.0
5.23k stars 2.36k forks source link

Transpiler yields less optimal circuits on larger topologies #10160

Open nonhermitian opened 1 year ago

nonhermitian commented 1 year ago

Environment

I have empirically noticed for some time that compiling circuits against smaller device topologies, e.g. a 16Q fake IBM backend vs. larger ones, yields improved output circuits, as defined by lower swap counts / added 2Q gates. Intuitively this does not make much sense as the smaller systems can be viewed as sub-graphs of the larger ones, and therefore the same optimal solutions found on the smaller graphs should be found within the larger topologies.

This is quite problematic because we are in a situation where we have large numbers of qubits, but the number of effectively usable qubits is less than the full device; exactly where this issue pops up.

What is happening?

To see this, I have plotted the number of added CX gates when compiling an all-ones BV circuits, an Efficient SU2 ansatz with circular entanglement, and QFT circuits as a function of size against various backends. Here 250 transpilations were performed using optimization_level=3. Bars show one standard deviation of the resulting distribution of added 2Q gates. It is clear to see that smaller device topology yields better circuits, or at least does no worse.

funny_business 001

How can we reproduce the issue?

Repeat the above, or use other circuits.

What should happen?

An optimal circuit found on a smaller machine should be discoverable on a larger device since they are sub-graphs.

Any suggestions?

That better solutions are directly linked to topology size, my guess is this is an issue with search space size, where somehow the transpiler is not finding better results in the larger space. @ashsaki has looked into this a bit with respect to the initial layout.

jakelishman commented 1 year ago

With these plots, did you have an initial_layout set? If not, I'm curious what the results are if the initial layout on a 127q device is set to match the topology of the initial_layout found by the transpiler for a 16q device (or whatever). Knowing that might help determine whether we should focus efforts on smarter ways to search the initial_layout space as part of the layout passes, or smarter ways to solve the routing problem afterwards.

nonhermitian commented 1 year ago

No initial layout was set.

nonhermitian commented 1 year ago

Following up on this a bit, I looked at comparing results vs PyTket for the case of circular entanglement. Because Tket is deterministic, but the Qiskit methods are not, I transpiled circuits from different width from 2->20 qubits 50 times, for 27, 127, and 433Q devices, and looked at the mean and spread of the results vs Tket.

On 27Q, Qiskit on average performs quite well verse Tket, and in the vast majority of cases the min value found after 50 trials was better than the deterministic Tket result. Every transpilation for Qiskit is much faster than Qiskit, so in this case the timings to do 50x the circuits is the same as 1 circuit in Tket

image

However, because of this issue, as the topology grows, the mean number of addtional gates due to swaps increases, and it becomes harder to out-perform Tket.

image

image

What we see is that, for identical circuits, the mean number of additional gates increases with topology size, and on average Tket begins to pull ahead. Basically, as the number of gates, and mean values increase, we are having a harder nad harder time of finding a better routing because we need to hit the tail end of a distribution that is moving further away. So on average, for this circuit at least, using the Qiskit transpiler is no better than Tket (but is faster). For 433Q, Tket is by far better on average in most cases. So in situations like the Runtime where this repeated transpilation is not carried out, the end user is getting subpar performance relative to other tools, and it becomes more and more imperative that the user transpiles locally. There is of course a cost to transpiling multiple times, and the cost grows with topology size for Qiskit, Tket seems to remain relatively constant

image

jakelishman commented 1 year ago

Can you share the circuits and scripts that you're benchmarking here? Something seems odd about our transpile times being hundreds of seconds, because I've successfully routed circuits of 1,000+q in a fraction of that time.

jakelishman commented 1 year ago

Fwiw, I have a pretty good guess at what's going on here: our random sampling of start locations on entry to SabreLayout is much less likely to be something decent when we've got more qubits in the map (more chance of qubits starting miles away from each other), which means we spend most of our forwards/backwards passes of routing just bringing the "initial" qubits closer together, before we even start on the actual meat of polishing that guess.

Our heuristic for choosing initial layouts to try at the moment is a uniform random sampling over the "n_total permute n_virtual" mappings, with no regard to initial distance, symmetries of the coupling map, or anything, whereas tket has deterministic routines to choose something decent, but are generally much slower (and in part is why I'm really wondering about your compile times).

alexanderivrii commented 1 year ago

Jake, wouldn't this be mitigated by Sabre's forward-backward approach of updating layouts?

jakelishman commented 1 year ago

My strong hunch is that it'll be somewhat mitigated, but not sufficiently - we only do a finite number of forward/backward passes, and there's nothing that stops Sabre routing towards the end from routing gates "away" from the "centre" of the populated qubits, which would need more passes to correct. Much of the back-and-forth is meant to help with settling on the best initial permutation from within a given set of initial physical qubits, but in the cases of a coupling map that's much larger than the number of virtual qubits, we have to use up some of our back-and-forth passes just trying to get a decent combination from the initial set, before we can start worrying about the best permutation within that.

There's also no guarantee that Sabre routing will bring all the qubits "close" to each at the start/end - only those that interact nearby, and that potentially stymies routing's choices later down the line, since one swap may only ever affect one gate, whereas if all the qubits are packed into a dense subspace of the coupling graph, there's the opportunity for one swap to help several gates in the lookahead layers.

This is all super heuristic and a bit of me just speaking from my intuition about how Sabre achieves its results rather than a formal study, but it sounds pretty plausible to me.

mtreinish commented 1 year ago

I've been playing some more with #10169 and I'm not sure a "better" initial layout guess necessarily leads to better results. That PR sets a layout trial to use a partial isomorphic subgraph (it's build traversing the DAG and returns the mapping from the deepest node in the DAG) and then if there are extra qubits needed for a full layout (it's not always the case, for example for a circular entanglement EfficientSU2 a partial isomorphic subgraph is full width it just doesn't include the last cx to make it circular) it adds the nearest neighbors first. In testing the PR so far with a 16 qubit EfficientSU2 it rarely selects the starting partial layout as the one with the least amount of SWAPs. When it does select the partial layout it's not always the best possible result (based on other attempts which had lower 2q gate counts). This specific example might be a bad test for the approach in #10169 because the partial layout is the full circuit width and not accounting for the last interaction, I still need to do more exploration (also the code in the PR could have bugs, it's still a WIP). When I was testing the standalone PartialVF2Layout pass in https://github.com/mtreinish/vf2_partial_layout and also trying the hybrid with sabre approach last year in #9174 what I found is that the search space ends up being too constrained limiting the starting point to partial mappings, or at least that was my intuition at the time, and the early test results with #10169 seem to indicate the same thing.

I'm wondering if there is more of a middle ground we can influence the layout selected towards a better result but without constraining the search space too much. For #10169 the difference between that and my earlier attempt in #9174 is that the newer PR only uses one partial layout trial with nearest neighbors for the free qubits and the other trials are all full random. While the earlier attempt used a partial layout for all trials random permutations of the free qubits outside the partial layout. The other thing I want to try in this vain is instead of using a partial VF2 layout as the starting point is to try and use DenseLayout too and see if we just give sabre one trial as a starting point with the densest subgraph in the coupling map and see how that performs relative to full random and also the partial isomorphic case.

That all being said I'm wondering though if maybe it makes more sense in the meantime to just dial up the heuristic effort dials on SabreLayout for optimization level 3. So instead of running 20 layout trials, 4 layout iterations (back and forth routing calls), and 20 routing trials we do something like like multiply it by 5 and do 100 layout trials, 20 layout iterations, and 100 routing trials. We probably have sufficient time budget for O3 to spend a bit more time on layout with fully random trials, especially given the pass's performance has improved since we first set the number of trials we run.

mtreinish commented 1 year ago

I ran a quick test to see what the impact of extended set size is on the output here. After doing more testing with different types of input layout seeding (using various forms of VF2PartialLayout, DenseLayout, and ordering by betweenness centrality) I was finding none made a difference. But also that the swap counts had wide swings between different seed trials. My thinking is that the extra degree of freedom in the larger coupling map is also impacting the routing search in addition to the layout and we're just not searching deeply enough in the lookahead portion of the heuristic to factor against going the "wrong way" with swap insertion. To test this I ran a transpile() of:

qc = EfficientSU2(16, entanglement='circular', reps=6, flatten=True)
qc.assign_parameters([math.pi/2] * len(qc.parameters), inplace=True)
qc.measure_all()

against IBM Seattle with opt level 3 and a fixed seed. Then modified the sabre code to take the extended set size from an environment variable. The results are:

extended_set

The best case result found on IBM Guadalupe in my previous testing of this circuit was 168 2q gates which matches the graph when the extended set size was 72. There was also no measurable impact to the runtime because of this change as even with the larger sizes the runtime of transpile() was always < 5 sec and realistically was bound by the optimization loop time, which become a function of the output size from routing. I'm thinking to solve this the answer is to use a larger extended set size (in the short term while working a more thorough lookahead heuristic) and also increase the trial counts for level 3.

alexanderivrii commented 1 year ago

If we are specifically interested in transpiling all-ones BV circuits, then we can do a much job by implementing the CX-logic of these circuits adhering to linear-nearest-neighbor connectivity.

More precisely, the two circuits below are equivalent (while the second circuit has a perfect layout when the connectivity map of the device contains a line of the right size):

     ┌───┐      ░                           ░ ┌───┐┌─┐            
q_0: ┤ H ├──────░───■───────────────────────░─┤ H ├┤M├────────────
     ├───┤      ░   │                       ░ ├───┤└╥┘┌─┐         
q_1: ┤ H ├──────░───┼────■──────────────────░─┤ H ├─╫─┤M├─────────
     ├───┤      ░   │    │                  ░ ├───┤ ║ └╥┘┌─┐      
q_2: ┤ H ├──────░───┼────┼────■─────────────░─┤ H ├─╫──╫─┤M├──────
     ├───┤      ░   │    │    │             ░ ├───┤ ║  ║ └╥┘┌─┐   
q_3: ┤ H ├──────░───┼────┼────┼────■────────░─┤ H ├─╫──╫──╫─┤M├───
     ├───┤      ░   │    │    │    │        ░ ├───┤ ║  ║  ║ └╥┘┌─┐
q_4: ┤ H ├──────░───┼────┼────┼────┼────■───░─┤ H ├─╫──╫──╫──╫─┤M├
     ├───┤┌───┐ ░ ┌─┴─┐┌─┴─┐┌─┴─┐┌─┴─┐┌─┴─┐ ░ └───┘ ║  ║  ║  ║ └╥┘
q_5: ┤ X ├┤ H ├─░─┤ X ├┤ X ├┤ X ├┤ X ├┤ X ├─░───────╫──╫──╫──╫──╫─
     └───┘└───┘ ░ └───┘└───┘└───┘└───┘└───┘ ░       ║  ║  ║  ║  ║ 
c: 5/═══════════════════════════════════════════════╩══╩══╩══╩══╩═
                                                    0  1  2  3  4 
     ┌───┐      ░                                               ░ ┌───┐┌─┐            
q_0: ┤ H ├──────░───■───────────────────────────────────────■───░─┤ H ├┤M├────────────
     ├───┤      ░ ┌─┴─┐                                   ┌─┴─┐ ░ ├───┤└╥┘┌─┐         
q_1: ┤ H ├──────░─┤ X ├──■─────────────────────────────■──┤ X ├─░─┤ H ├─╫─┤M├─────────
     ├───┤      ░ └───┘┌─┴─┐                         ┌─┴─┐└───┘ ░ ├───┤ ║ └╥┘┌─┐      
q_2: ┤ H ├──────░──────┤ X ├──■───────────────────■──┤ X ├──────░─┤ H ├─╫──╫─┤M├──────
     ├───┤      ░      └───┘┌─┴─┐               ┌─┴─┐└───┘      ░ ├───┤ ║  ║ └╥┘┌─┐   
q_3: ┤ H ├──────░───────────┤ X ├──■─────────■──┤ X ├───────────░─┤ H ├─╫──╫──╫─┤M├───
     ├───┤      ░           └───┘┌─┴─┐     ┌─┴─┐└───┘           ░ ├───┤ ║  ║  ║ └╥┘┌─┐
q_4: ┤ H ├──────░────────────────┤ X ├──■──┤ X ├────────────────░─┤ H ├─╫──╫──╫──╫─┤M├
     ├───┤┌───┐ ░                └───┘┌─┴─┐└───┘                ░ └───┘ ║  ║  ║  ║ └╥┘
q_5: ┤ X ├┤ H ├─░─────────────────────┤ X ├─────────────────────░───────╫──╫──╫──╫──╫─
     └───┘└───┘ ░                     └───┘                     ░       ║  ║  ║  ║  ║ 
c: 5/═══════════════════════════════════════════════════════════════════╩══╩══╩══╩══╩═
                                                                        0  1  2  3  4 
nonhermitian commented 1 year ago

BV is just an example that requires lots of swaps, if done in the canonical manner. It can be rewritten using resets as a 2Q problem. The issue here is that the swap mappers seem to be having a harder time on circuit requiring swaps as the search space grows