Performance difference between circuit synthesis and unitary synthesis when synthesizing a unitary with size greater than 3

HaoTy commented 2 months ago

When max_synthesis_size is 4 or above, directly synthesizing a unitary takes significantly longer than wrapping the unitary in a circuit. Reproducible example:

import numpy as np
import qiskit
from bqskit.ir.circuit import Circuit
from bqskit.compiler import Compiler, MachineModel, compile
from bqskit.passes import (
    LEAPSynthesisPass,
    SetModelPass,
    SetTargetPass,
)
from bqskit.qis import CouplingGraph, UnitaryMatrix

n = 4
circuit = qiskit.circuit.library.QFT(n)
target_unitary = qiskit.quantum_info.Operator(circuit).to_matrix()
compiled_circuit = compile(
    Circuit.from_unitary(target_unitary), # fast
    # target_unitary, # extremely slow
    MachineModel(n, CouplingGraph.linear(n)),
    optimization_level=1,
    max_synthesis_size=n,
)

In addition, the above compiled circuit is only correct for the all-zero initial state:

print(1 - np.abs(
    np.trace(
        compiled_circuit.get_unitary()
        @ target_unitary.conj().T
    )
) / 2**n)

which does not seem to be by design judging from the workflow implementation.

Directly running LEAP as follows correctly synthesizes the unitary:

with Compiler() as compiler:
    workflow = [
        SetModelPass(MachineModel(n, CouplingGraph.linear(n))),
        SetTargetPass(UnitaryMatrix(target_unitary)),
        LEAPSynthesisPass(),
    ]
    compiled_circuit = compiler.compile(
        Circuit.from_unitary(target_unitary),
        workflow,
    )

edyounis commented 2 months ago

Thanks for the detailed code and questions in the post. It makes it easy to understand your situation and help. There are two issues here:

1) Why does compile give something seemingly incorrect? 2) Why does the direct synthesis workflow take longer than compile?

I have been able to reproduce both of these, but neither is a bug. The first is because compile performs mapping which does not respect input and output qudit orderings. If you set with_mapping=True in the compile call, you will get two additional return values describing the input and output qudit orderings:

compiled_circuit, pre, post = compile(
    Circuit.from_unitary(target_unitary), # fast
    # target_unitary, # extremely slow
    MachineModel(n, CouplingGraph.linear(n)),
    optimization_level=1,
    max_synthesis_size=n,
    with_mapping=True,
)

Which can then be used to verify the problem:

from bqskit.qis import PermutationMatrix
Pi = PermutationMatrix.from_qubit_location(n, pre)
Pf = PermutationMatrix.from_qubit_location(n, post)
print(1 - np.abs(
    np.trace(
        compiled_circuit.get_unitary()
        @ (Pf.T @ target_unitary @ Pi).conj().T
    )
) / 2**n)

This gives 0.0, but I wouldn't be surprised if you get something with an e-15 on some runs. The verification procedure is also equivalent in math to:

print(compiled_circuit.get_unitary().get_distance_from(Pf.T @ target_unitary @ Pi, 1))

For (2), the different workflows take varying amounts of time because they do different things. The "Off-the-Shelf" compile workflow at optimization level 1 will synthesize the circuit with all-to-all connectivity and then map using SABRE. The direct synthesis call will directly synthesize for linear connectivity. This will take (usually) longer as the search procedure will need to go deeper into the search tree to find a solution for less connected architectures. These differences lead to different results in the quality of output:

print(compiled_circuit.gate_counts)

yields

{CNOTGate: 23, U3Gate: 53}

for compile, and

{RYGate: 38, CNOTGate: 19, U3Gate: 4, RXGate: 19, RZGate: 19}

for the direct synthesis call. So, even though the direct synthesis call took a little bit longer, it gave you better results. These will definitely vary from run to run unless you fix the seed. Both of them are not the best we can do with BQSKit. If really want the best result and are okay trading some more time for quality, then you should use PAS or PAM for the QFT. This can be done with optimization_level=4 or by building the workflow yourself using build_seqpam_mapping_optimization_workflow

The real issue here may be documentation. If you feel that this information should be included somewhere in the documentation, I will gladly accept a PR. I will leave this comment open. Feel free to follow up with more questions or close it.

HaoTy commented 2 months ago

Thanks for your fast response and detailed explanation! I now understand bqskit better! However, the second issue is not

Why does the direct synthesis workflow take longer than compile?

but the opposite. The speed of the default compile unitary synthesis workflow was drastically slower for me than directly specifying

workflow = [
    SetModelPass(MachineModel(n, CouplingGraph.linear(n))),
    SetTargetPass(UnitaryMatrix(target_unitary)),
    LEAPSynthesisPass(),
]

even though they are almost the same (the differences are some IfThenElsePasses and a single qudit retarget pass for optimization level 1). I couldn't follow the debugging tracing very well due to the worker mechanism, but the extra time seemed to be spent somewhere else than on LEAP.

BQSKit / bqskit

Performance difference between circuit synthesis and unitary synthesis when synthesizing a unitary with size greater than 3 #241