sabre algorithm improvements in transpilation at level 3 not in effect

johannesgreiner commented 2 years ago

Environment

Qiskit Terra version: 0.22.2 (vs. 0.21.1)
Python version: 3.8.2
Operating system: Windows

What is happening?

transpilation at optimization_level=3 does not (significantly) perform better in terra 0.22.2 than 0.21.1 despite improvements in sabre algorithm.

How can we reproduce the issue?

both in terra 0.22.2 and 0.21.1 run:

qc.x(5)
qc.h(range(6))
qc.cx(range(5),5)
qc.h(range(5))
qc.measure(range(5), range(5))
qc.draw('mpl')

# Transpile the circuit 'qc' 1000 times
trans_circ_list = transpile([qc]*1000, backend, optimization_level=3)

# get the number of cnot gates
cx_counts = np.asarray([circ.count_ops().get('cx', 0) for circ in trans_circ_list])

# create violin plot
import seaborn as sns
ax = sns.violinplot(data=[cx_counts])

What should happen?

Acc to level 3 preset , there should be 20 swap_trials performed for each iteration of transpile. So, the performance should be improved by these 20 iterations. Instead, there is no visible difference, but doing 20 iterations of transpile manually in qiskit 0.21.1 performs much better (will take about 60mins on a 4core-CPU):

global_circ_list=[]

for i in range(1000):
    # Transpile the circuit 'qc' 20 times
    trans_circ_list_loop = transpile([qc]*20, backend, optimization_level=3)

    # get the number of cnot gates
    cx_counts_loop = np.asarray([circ.count_ops().get('cx', 0) for circ in trans_circ_list_loop])
    #print('Number of cnots:', cx_counts_loop)

    # select the circuit with the lowest number
    best_idx_loop = np.argmin(cx_counts_loop)
    best_trans_qc_loop = trans_circ_list_loop[best_idx_loop]
    global_circ_list.append(best_trans_qc_loop)

cx_counts_global = np.asarray([circ.count_ops().get('cx', 0) for circ in global_circ_list])

combining all in the plot shows that manually pre-transpiling in the older qiskit version performs much better

pretransp violin plot qiskit cnots comparison

Any suggestions?

there might be a minimization step missing, or a mixup of max_iterations and swap_trials in https://github.com/Qiskit/qiskit-terra/blob/90b158c7e02432db957762e58c4c2ed75d89a1ca/qiskit/transpiler/preset_passmanagers/level3.py#L140

jakelishman commented 2 years ago

This isn't a fair comparison, which is likely where the difference is coming in. swap_trials modifies how many different attempts SabreSwap makes each time it's called, but when you rerun transpile, you get different randomisation for the initial layout chosen in SabreLayout (which isn't subject to repeats), before SabreSwap even runs for the first time. The final output of the transpilation is more susceptible to changes in that layout, but right now we're not doing multiple attempts of that component (only the internal routing done within SabreLayout).

jakelishman commented 2 years ago

I would expect to see a similar improvement for your circuit if you tried the same repeated transpile on 0.22 as you did on 0.21, for the same reason.

nonhermitian commented 2 years ago

I think the point is that the parallel swap trials should be picking the lowest swap count circuit, when it is clear that it is currently not. The goal was to eliminate (to 1st order) the need for users to have to do this themselves now that the swap mapper is much faster

jakelishman commented 2 years ago

We're not finding a lower swap count and then throwing it away. The trick is that a full transpile runs the entire layout pass with new seeds, whereas we're just reseeding the internal routing passes used within the layout pass. It was much easier to get that to run in parallel than repeating the entire layout pass lots of times, because it happens in Rust space, not Python space.

We are looking to revise the defaults, and re-running the whole layout is something we're interested in, for sure. I'm just saying that this is a bug report suggesting there's an actual mistake in SabreLayout, and I don't think that's the case - what's happened here was by design.

nonhermitian commented 2 years ago

Ok, then it is a bit of a misunderstanding then. We will just go back to telling people to transpile multiple times again for best results. I think this can probably be closed since this is currently expected behavior.

jakelishman commented 2 years ago

We can leave it open to track - this is something we want to improve on.

johannesgreiner commented 2 years ago

Thanks for clarifying. More of a feature request that you already have in mind then

mtreinish commented 2 years ago

I pushed up https://github.com/Qiskit/qiskit-terra/pull/9116 to do multiple seed trials for the combination of layout and routing here. I still need to tune the PR and characterize the performance (both in quality and runtime). But if you'd like to give that a try and see how it works for your use case that would be useful feedback while it's still under review/being developed.

mtreinish commented 2 years ago

Playing a bit locally with the bv example you were using, for the bernstein vazirani circuit the output quality is definitely primarily a function of the layout. I did a nested sweep to 250 trials of both layout and routing trials with #9116 applied. The script is below (which will only work if you do pip install git+https://github.com/mtreinish/qiskit-core@sabre-layout-rust and have a rust compiler available).

import csv
import statistics
import time

import numpy as np
import retworkx as rx

from qiskit import QuantumCircuit
from qiskit.converters import circuit_to_dag, dag_to_circuit
from qiskit.transpiler import CouplingMap
from qiskit.circuit.library import QuantumVolume, QFT
from qiskit.transpiler.passes import SabreLayout, Unroll3qOrMore
from qiskit.providers.fake_provider import FakeMumbaiV2
from qiskit.compiler import transpile

def bench_bv():
    rng = np.random.default_rng(42_11_11_2022)
    with open("bv.csv", "w", newline="") as csvfile:
        times_writer = csv.writer(csvfile)
        times_writer.writerow(["swap_trials", "layout_trials", "run_time", "swap_count", "depth"])
        basis_gates = ["cx", "x", "sx", "rz", "reset", "delay"]
        backend = FakeMumbaiV2()
        cmap = backend.coupling_map

        qc = QuantumCircuit(6, 5)
        qc.x(5)
        qc.h(range(6))
        qc.cx(range(5),5)
        qc.h(range(5))
        qc.measure(range(5), range(5))
        for seed in rng.integers(0, 4294967295, size=1, dtype=int):
            for layout_trials in sorted(set(np.linspace(1, 250, dtype=int))):
                for swap_trials in sorted(set(np.linspace(1, 250, dtype=int))):
                    print(f"Layout trials: {layout_trials}, Swap Trials: {swap_trials}")
                    super_pass = SabreLayout(coupling_map=cmap, seed=seed, swap_trials=swap_trials, layout_trials=layout_trials)
                    dag = circuit_to_dag(qc)
                    start = time.perf_counter()
                    res = super_pass.run(dag)
                    stop = time.perf_counter()
                    run_time = stop - start
                    swap_count = res.count_ops().get('swap', 0)
                    out_depth = res.depth()
                    times_writer.writerow([swap_trials, layout_trials, run_time, swap_count, out_depth])
                    print(f"Run time: {run_time}, non_local gates: {swap_count}, depth: {out_depth}")

if __name__ == "__main__":
    bench_bv()

which yielded this result:

bv.csv

The swap count and depth decrease only as we increase the number of layout trials. But I think this is something that doesn't hold true for all circuits. Like I know with qft we can get improved output quality when running with just higher swap trials

nonhermitian commented 2 years ago

This is interesting. Let me play with the notebooks that I have looking at some of the other circuits.

nonhermitian commented 2 years ago

One thing of interest is 5Q tests on Quito (Fake version). It seems that I consistently get a larger cx gate count then a year ago where the max was ~85 or so: https://quantum-enablement.org/posts/2021/2021-10-31-best_swap_mapper_qiskit.html#efficient-su2-full

Now I consistently get >100 (on the branch above) using:


import numpy as np
from qiskit import QuantumCircuit
from qiskit.providers.fake_provider import FakeMumbaiV2, FakeQuitoV2
from qiskit.compiler import transpile

from qiskit.circuit.library import QuantumVolume, EfficientSU2, TwoLocal, QFT

backend = FakeQuitoV2()
qc = EfficientSU2(5, entanglement='full')
trans_circs = transpile([qc]*20, backend, optimization_level=3)
[circ.count_ops().get('cx',0) for circ in trans_circs]

E.g.

I need to decouple the notebook from Tket before running everything, but what is given here is the first thing I tried

mtreinish commented 2 years ago

Hmm, yeah I was seeing bad performance with that branch in other benchmarks too: https://github.com/Qiskit/qiskit-terra/pull/9116#issuecomment-1311955595 I'm still not super confident with it which is why it's [WIP] still, I probably broke something or am doing something else wrong in the PR.

At least running the EfficientSU2 example on main and 0.22.2 returns 68 for all the circuits.

mtreinish commented 2 years ago

I think that I fixed the issues in #9116 that was causing poor results before. The output of running the EfficientSU2 compilation with the PR applied is now [72, 68, 68, 68, 68, 72, 68, 68, 68, 68, 72, 68, 68, 72, 68, 68, 72, 72, 68, 68] (the behavior with the BV circuits remains unchanged).

Qiskit / qiskit