NVIDIA / cuda-quantum

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows
https://nvidia.github.io/cuda-quantum/
Other
568 stars 191 forks source link

[RFC] Mapping pass #669

Closed boschmitt closed 5 months ago

boschmitt commented 1 year ago

The following PRs are related to this mapping pass:

  1. 666

  2. 667

  3. 668

  4. 761

  5. 610

  6. 837

  7. 838

Background

The physical qubits in quantum devices might not be fully connected, which means that not every pair can participate in the same two-qubits operations. These connectivity restrictions are known as coupling constraints. In our compiler, the IR has no mechanism to enforce them, i.e., we can apply an operation between any pair of qubits. Therefore, our quantum programs typically cannot be directly executed on these restricted quantum devices. We refer to them as unmapped programs and define virtual (or logical) qubits and operation as those present in them. The task of finding a mapping of virtual operations to allowed physical ones is known as quantum circuit mapping. The completion of this task is not always possible without applying additional operators---typically, swaps.

We can model the connectivity requirements of an unmapped program using an undirected graph $G_{r} = (V, E_r)$, where $V$ is the set of virtual qubits $V = { v_0, v1, ..., v{n-1}}$ and $E_r$ is a set of qubit pairs ${v_i, v_j}$ required by the operations in the program. (Note that one-qubit operations can be safely ignored since the coupling constraints do not affect their mapping.) We can model the coupling constraints in a similar way. A undirected graph $G_p = (P, E_p)$ where $P = { p_0, p1, ..., p{m-1}}$ is the set of physical (device) qubits and an edge ${p_u, p_w} \in E_p$ means that a two-qubit operation can be executed using the two physical qubits $p_u$ and $p_w$.

Placement. The goal is to find a subgraph monomorphism $\pi : V \to P$ that respects ${v_i, v_j} \in E_r \implies {\pi(v_i), \pi(v_j)} \in E_p$. If such a morphism exists, then the mapping task requires only a relabeling of the virtual qubits to physical qubits, i.e., replace each virtual qubit $v_i$ in the unmapped program by a physical qubit $\pi(v_i)$. Otherwise, we must resort to routing.

Routing. Given an initial placement, routing transforms a program so that all two-qubit operations are on physically adjacent qubits. Thus, our goal becomes finding a way of mapping a program on a device architecture with low overhead, whether in the number of additional operation or depth.

Example: Take the following quake kernel

func.func @foo() {
  %0 = quake.alloca : !quake.ref
  %1 = quake.alloca : !quake.ref
  %2 = quake.alloca : !quake.ref
  %3 = quake.alloca : !quake.ref

  quake.h %0 : (!quake.ref) -> ()
  quake.t %0 : (!quake.ref) -> ()
  quake.t %1 : (!quake.ref) -> ()
  quake.t %2 : (!quake.ref) -> ()
  quake.x [%1] %2 : (!quake.ref, !quake.ref) -> ()
  quake.x [%0] %1 : (!quake.ref, !quake.ref) -> ()
  quake.x [%2] %0 : (!quake.ref, !quake.ref) -> ()
  quake.t %0 : (!quake.ref) -> ()
  quake.t<adj> %1 : (!quake.ref) -> ()
  quake.x [%2] %1 : (!quake.ref, !quake.ref) -> ()
  quake.t<adj> %1 : (!quake.ref) -> ()
  quake.t<adj> %2 : (!quake.ref) -> ()
  quake.x [%0] %1 : (!quake.ref, !quake.ref) -> ()
  quake.x [%2] %0 : (!quake.ref, !quake.ref) -> ()
  quake.x [%1] %2 : (!quake.ref, !quake.ref) -> ()
  quake.h %0 : (!quake.ref) -> ()
  quake.x [%3] %0: (!quake.ref, !quake.ref) -> ()
  quake.x [%0] %3: (!quake.ref, !quake.ref) -> ()
  return
}

The graph that models the connectivity requirements, $G_r$, is: image

We can map this kernel to the following device (left) without the need to add new operations: image

However, we cannot do the same for the following one: image

In this case, there was no placement that satisfied all connectivity requirements, so we choose one and now we need a to modify the circuit so that the operations between qubits $1 and $2 (red edge) can be applied.

Prototype

Device

This prototype adds a Device type (cudaq/Support/Device.h) to hold information about the target device. Currently, it only has what is necessary for mapping:

The Device class has a type to represent physical qubits: Device::Qubit, which is just an unsigned integer. The class also has static methods to create mock devices: Device::path(...), Device::ring(...), Device::start(...) and Device::grid(...).

Mapping pass

I will use an example to explain the inner works of the pass. Take the following quake kernel:

func.func @foo() {
  %0 = quake.alloca : !quake.ref
  quake.h %0 : (!quake.ref) -> ()
  %1 = quake.alloca : !quake.ref
  %2 = quake.alloca : !quake.ref
  quake.x [%0] %2 : (!quake.ref, !quake.ref) -> ()
  quake.x [%0] %1 : (!quake.ref, !quake.ref) -> ()
  return
}

Suppose we want to map this kernel to a device with three qubits in a path topology, i.e., (p0) --- (p1) --- (p2), where pn denotes device (physical) qubit $p_n$. The mapper requires kernels to be represented as a directed acyclic graph (DAG), and thus only works with Quake’s value model, i.e., the model where operations use wires instead of qubit references. Hence, before mapping, we need to convert the kernel:

func.func @foo() {
  %0 = quake.null_wire
  %1 = quake.h %0 : (!quake.wire) -> !quake.wire
  %2 = quake.null_wire
  %3 = quake.null_wire
  %4:2 = quake.x [%1] %3 : (!quake.wire, !quake.wire) -> (!quake.wire, !quake.wire)
  %5:2 = quake.x [%4#0] %2 : (!quake.wire, !quake.wire) -> (!quake.wire, !quake.wire)
  return
}

The mapper starts by analyzing the kernel to make sure the preconditions for mapping are met. Currently the preconditions are:

During this analysis, the mapper also takes to opportunity to label each wire with its corresponding virtual qubit and collect the source operations for all qubits, i.e., quake.null_wire operations. At the end of this analysis, we will have a vector with all source operations:

  * %0 = quake.null_wire
  * %2 = quake.null_wire
  * %3 = quake.null_wire

and a mapping between each wire and its corresponding virtual qubit:

  * %0 -> v0
  * %1 -> v0
  * %2 -> v1
  * %3 -> v2
  * %4#0 -> v0
  * %4#1 -> v2
  * %5#0 -> v0
  * %5#1 -> v1

where vn denotes virtual qubit $v_n$.

If the preconditions are met, the pass proceeds with placement and routing. This prototype focuses on the routing algorithm while implementing a simple “identity placement,” i.e., a virtual qubit $v_i$ will be placed on the physical qubit $p_i$. Hence, our initial placement is:

  * v0 <-> p0 
  * v1 <-> p1
  * v2 <-> p2

The mapper does routing by implementing the algorithm described in:

Routing starts with source operations collected during the analysis. These operations form a layer, called the font layer, which is a set of operations that have no unmapped predecessors. In the case of these source operations, the router only needs to iterate over the front layer while visiting all users of each operation. The processing of this front layer will create a new front layer containing all operations that have being visited the same number of times as their number of wire operands.

After processing the very first front layer, the algorithm proceeds to process the newly created front layer. Once again, it processes the front layer and map all operations that are compatible with the current placement, i.e., one-wire operations and two-wire operations using wires that correspond to qubits that are adjacently placed in the device. When an operation is successfully mapped, it is removed from the front layer and all its users are visited---those users that have no unmapped predecessors are added to the front layer. If the mapper cannot successfully map any operation in the front layer, then it adds a swap to the circuit and tries to map the front layer again. The routing process ends when the front layer is empty.

In our example, after processing the front layer with source operations, we have an updated front layer with only quake.h. Since it is a one-wire operation, we can directly map it and visit its users---which will update the front layer to only contain the first quake.x operation:

//===-------------------------------------------===//
Mapping front layer:
  * %1 = "quake.h"(%0) {operand_segment_sizes = array<i32: 0, 0, 1>} : (!quake.wire) -> !quake.wire--> SUCCESS

//===-------------------------------------------===//
Mapping front layer:
  * %4:2 = "quake.x"(%1, %3) {operand_segment_sizes = array<i32: 0, 1, 1>} : (!quake.wire, !quake.wire) -> (!quake.wire, !quake.wire)--> FAILURE
    + virtual qubits: 0, 2
    + device qubits: 0, 2

Note that this quake.x uses two virtual qubits that are not mapped to adjacent device qubits, hence it fails to map. At this point, we need to modify the placement using swaps:

Choosing a swap:
  Involved device qubits: 0 2
  Swap candidates:
    * 0, 1 (cost = 0.000000e+00)
    * 2, 1 (cost = 5.000000e-01)

  Selected swap: 0, 1

The mapper choose to add a swap between p0 and p1. This will modify the placement to:

  * v1 <-> p0 
  * v0 <-> p1
  * v2 <-> p2

With this change, it will be possible to map the operation that is in the front layer, its users will be visited and the front layer will be updated with the addition of the second quake.x operation. This operation uses two virtual qubits that are adjacently place, so it can be successfully mapped:

//===-------------------------------------------===//
Mapping front layer:
  * %6:2 = "quake.x"(%5#0, %3) {operand_segment_sizes = array<i32: 0, 1, 1>} : (!quake.wire, !quake.wire) -> (!quake.wire, !quake.wire)--> SUCCESS

//===-------------------------------------------===//

After this the front layer will be empty, and the mapper is done. Here is the full debug information from the pass (using debug-only=quantum-mapper):

//===-------------------------------------------===//
Mapping front layer:
  * %0 = quake.null_wire --> SUCCESS
  * %2 = quake.null_wire --> SUCCESS
  * %3 = quake.null_wire --> SUCCESS
//===-------------------------------------------===//

//===-------------------------------------------===//
Mapping front layer:
  * %1 = "quake.h"(%0) {operand_segment_sizes = array<i32: 0, 0, 1>} : (!quake.wire) -> !quake.wire --> SUCCESS

//===-------------------------------------------===//
Mapping front layer:
  * %4:2 = "quake.x"(%1, %3) {operand_segment_sizes = array<i32: 0, 1, 1>} : (!quake.wire, !quake.wire) -> (!quake.wire, !quake.wire) --> FAILURE
    + virtual qubits: 0, 2
    + device qubits: 0, 2

Choosing a swap:
  Involved device qubits: 0 2
  Swap candidates:
    * 0, 1 (cost = 0.000000e+00)
    * 2, 1 (cost = 5.000000e-01)

  Selected swap: 0, 1

//===-------------------------------------------===//
Mapping front layer:
  * %5:2 = "quake.x"(%2, %4) {operand_segment_sizes = array<i32: 0, 1, 1>} : (!quake.wire, !quake.wire) -> (!quake.wire, !quake.wire) --> SUCCESS

//===-------------------------------------------===//
Mapping front layer:
  * %6:2 = "quake.x"(%5#0, %3) {operand_segment_sizes = array<i32: 0, 1, 1>} : (!quake.wire, !quake.wire) -> (!quake.wire, !quake.wire) --> SUCCESS

//===-------------------------------------------===//

TODO

owen-oqc commented 1 year ago

As far as I understand this feature works by introducing swap gates. Since swap gates will have a fidelity, this will affect the overall error budget of the intended algorithm. While the swap gates are necessary to achieve the aims of the feature, I do not see an option to run in verbatim mode and that would be undesirable where close control of the errors is sought.

bmhowe23 commented 1 year ago

Hi @owen-oqc - yes, that's correct. It's worth noting that some other compiler passes (like multicontrol-decomposition) could also introduce ancilla qubits, so it may be tricky to give the user full control over the manual placement of all qubits. Would you mind opening a new issue to start a discussion there?

We may have to do a near-term solution for 0.5 and a longer-term solution, but I think it merits having it's own issue to clarify the precise desires. For example - we could imagine these other related desires (below), and would like to give the community an opportunity to weigh in.

bettinaheim commented 5 months ago

Closing this as is for now. Further improvements will be covered as part of introducing a smarter qubit scheduling logic.