NVIDIA / cuda-quantum

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows
https://nvidia.github.io/cuda-quantum/
Other
444 stars 155 forks source link

[RFC] Decomposition #85

Closed boschmitt closed 10 months ago

boschmitt commented 1 year ago

This is already a lot but not all, I'm posting it to kickstart the discussion

Definition

In quantum compilation, the term "decomposition" is quite overloaded since many techniques and transformations can fit the idea of "breaking down" (transforming) one gate into a sequence of other gates. Hence, it will be necessary for us to pinpoint what we mean by "decomposition" in the scope of the compiler. One tentative definition is the following:

We define "decomposition" as systematically breaking down high-level gates into a series of lower-level ones. A decomposition technique builds this lower-level sequence by applying some construction rule(s). We emphasize "systematic" because it is the characteristic that differentiates decomposition from other transformation techniques. For example, compared to synthesis, decomposition techniques should be faster and produce predictable results, i.e., we know the result in the number of operations and the number of qubits in advance.

Note that this definition still has some loose ends; specifically, it needs more clarity about which gates are considered high-level and low-level. As we will see next, such clarification requires more context.

Role in compilation

While compiling a technology-independent quantum program, high-level operations must often be broken down into lower-level ones. For example, transforming a two-control Z operation into a sequence of CNOTs and Ts:

                                                                  ┌───┐
  ───●────  ──────────────●───────────────────●──────●─────────●──┤ T ├
     │                    │                   │      │         │  └───┘
     │                    │                   │    ┌─┴─┐┌───┐┌─┴─┐┌───┐
  ───●─── = ────●─────────┼─────────●─────────┼────┤ X ├┤ ┴ ├┤ X ├┤ T ├
     │          │         │         │         │    └───┘└───┘└───┘└───┘
   ┌─┴─┐      ┌─┴─┐┌───┐┌─┴─┐┌───┐┌─┴─┐┌───┐┌─┴─┐                 ┌───┐
  ─┤ z ├─   ──┤ X ├┤ ┴ ├┤ X ├┤ T ├┤ X ├┤ ┴ ├┤ X ├─────────────────┤ T ├
   └───┘      └───┘└───┘└───┘└───┘└───┘└───┘└───┘                 └───┘

(Note: denotes the adjoint of T)

We call the above a decomposition pattern, and applying it to a circuit objectively breaks down high-level operations in sequences of low-level ones. (This is always the case when breaking down multi-control operations.) However, what is considered high-level (low-level) changes depending on the target device. For example, suppose we want to try a program on different devices. In one of these devices, the CNOT is a primitive operation, while in another, it is not, but CZ is. Hence, for the first device, we need to decompose CZ but not CNOT, while for the second is the other way around. We could use the following patterns:

CZ decomposition CNOT decomposition
  ───●─── = ────────●────────
     │              │
   ┌─┴─┐     ┌───┐┌─┴─┐┌───┐
  ─┤ Z ├─   ─┤ H ├┤ X ├┤ H ├─
   └───┘     └───┘└───┘└───┘
  ───●─── = ────────●────────
     │              │
   ┌─┴─┐     ┌───┐┌─┴─┐┌───┐
  ─┤ X ├─   ─┤ H ├┤ Z ├┤ H ├─
   └───┘     └───┘└───┘└───┘

Ultimately, the compilation process must implement the program using only the primitive unitary operators supported by the specific quantum device. In this sense, the compiler cannot blindingly decompose operations; the process must move towards these primitive operations and, hopefully, not get lost or enter a loop. For example, if we were decomposing a two-control Z operation (CCZ) while targeting the second device, we could first apply the pattern for CCZ and then use the pattern for CZ---meaning that we compose decomposition patterns: by applying one pattern, we create a sequence of gates that might itself be further decomposed. Note that we would enter a loop if we tried to apply both the CZ and CNOT decomposition patterns.

Before moving on, let's further suppose that H is not a primitive operation in these devices, but Rx is. We would need to transform these Hadamards gates into Rx ones---which, however, is a 1-1 transformation and doesn't fit our decomposition definition because it doesn't "break down" an operation.

Implementation

This RFC discusses two alternative ways of implementing decomposition into our compilation flow. However, those different ways might not be alternatives to each other but complementary.

Quake or QTX

First, we should address "where" decomposition must happen in the compilation flow. The compiler provides two dialects to represent quantum programs, Quake and QTX, and both have a similar level of abstraction regarding quantum operations. For example, both can have an X operation with an arbitrary number of controls. The difference lies elsewhere: Quake uses memory (reference) semantics and thus is not an SSA representation, while QTX is---in fact, QTX has a few quirks concerning arrays to keep them in an SSA form. The following example shows this difference:

Quake QTX
func.func @foo() {
  %q0 = quake.alloca : !quake.qref
  %q1 = quake.alloca : !quake.qref
  quake.h %q0
  quake.x [%q0] %q1
  ...
}
qtx.circuit @foo() {
  %w0_0 = qtx.alloca : !qtx.wire
  %w1_0 = qtx.alloca : !qtx.wire
  %w0_1 = qtx.h %w0_0 : !qtx.wire
  %w1_1 = qtx.x [%w0_1] %w1_0 : !qtx.wire
  ...
}

In Quake, the quantum operations do not create new values, so both X and H act on the same qubit reference %q0. If Quake was SSA, this would imply that changing the order of X and H would not affect the result---which is not the case. In QTX, quantum operations return new values for their targets but not for controls; this aligns with the definition of SSA and shows the order of two operations that share control wires, but not targets, can be changed without affecting the result.

The question remains:

Doing decomposition in Quake is more straightforward, as the discussion above demonstrates. Indeed, we would rewrite the operation using a specific decomposition pattern, and we can more and less straightforwardly rely on existing MLIR infrastructure to do it.

Decomposition in QTX is more complicated: a simple pattern such as the "CZ" one wouldn't be problematic since it only uses one target and produces one new value. However, the "CCZ" pattern complicates things: when using this pattern, we will substitute an operation that generates one new value with a sequence of operations that creates three---the controls become targets. This means that when decomposing a CCZ, say %new_t = x [%c0, %c1] %t, we need to change any use of %c0 and %c1 that come after this operation, but not the ones that come before. (Unfortunately, MLIR seems to not be built to handle this sort of rewriting, so it requires a bit more work.)

Despite the complications, I think decomposition should still happen in QTX. The reasoning here is that decompositions do not occur in isolation, and we should retain high-level operations until optimization passes are run. These optimization passes are more likely to exist in QTX since the dialect provides a great advantage to specific optimizations as data dependencies are explicit in the program structure and thus is the dialect used for optimization. Therefore, the more likely compilation flow looks like this:

  1. AST to Quake
  2. Quake to QTX
  3. Optimize
  4. Decompose
  5. Optimize more

If we decompose too early, we will hinder the ability to optimize high-level operations since once they are decomposed, it will be less likely that the compiler will to be able to completely optimize them out.

Alternatively, we could keep converting between Quake and QTX:

  1. AST to Quake
  2. Quake to QTX
  3. Optimize
  4. QTX to Quake
  5. Decompose
  6. Quake to QTX
  7. Optimize more

To consider this idea, we need to further investigate how costly is this conversion and whether it can scale up. An interesting advantage of doing decomposition in Quake is that we can keep short-circuiting the compiler when optimizations are turned off---since there would be no need to use QTX.

Compiler rewrite patterns or library function

Now we turn to whether we should implement decomposition using compiler rewrite patterns or library functions. I will use the following toy example to guide the discussion:

CUDA Quantum QTX
void foo() __qpu__ {
  cudaq::qubit c0, c1, t;
  x<cudaq::ctrl>(c0, c1, t); 
}
qtx.circuit @foo() {
  %c0 = qtx.alloca : !qtx.wire
  %c1 = qtx.alloca : !qtx.wire
  %t = qtx.alloca : !qtx.wire
  %t_0 = x [%c0, %c1] %t : [!qtx.wire, !qtx.wire] !qtx.wire
}

We want to decompose the CCX gate into a sequence of CXs and Ts. The first step could be decomposing CCX into CCZ and H, which would then be followed by the decomposition of CCZ.

  ───●────  ────────●────────
     │              │
     │              │
  ───●─── = ────────●────────
     │              │
   ┌─┴─┐     ┌───┐┌─┴─┐┌───┐
  ─┤ x ├─   ─┤ H ├┤ Z ├┤ H ├─
   └───┘     └───┘└───┘└───┘

Let's look at how a decomposition pass would differ in both cases. For reference, here is how the rewrite pattern and library function would look:

Rewrite pattern Library function
struct XOpDecomposition : public OpRewritePattern {
  using OpRewritePattern::OpRewritePattern;

  LogicalResult matchAndRewrite(
    qtx::XOp op, PatternRewriter &rewriter) const override {

    Location loc = op->getLoc();
    Value t = op.getTarget();
    t = createOp<qtx::HOp>(rewriter, loc, t);
    t = createOp<qtx::ZOp>(rewriter, loc, op.getControls(), t);
    t = createOp<qtx::HOp>(rewriter, loc, t);
    op.getResult(0).replaceAllUsesWith(t);
    rewriter.eraseOp(op);
    return success();
  }
};
void x_decomp(cudaq::qubit &c0, 
              cudaq::qubit &c1,
              cudaq::qubit &t) {
  h(t);
  z<cudaq::ctrl>(c0, c1, t);
  h(t);
}

(Note: I assuming that decomposition is done on QTX. I will try to enumerate differences if it was done in Quake, for one: the rewrite patterns would be simpler)

Bird's eye view

The result of using a compiler rewriter pattern is straightforward. The decomposition pass will walk the IR and rewrite, in place, all operations that have a registered decomposition pattern:

qtx.circuit @foo() {
  %c0 = qtx.alloca : !qtx.wire
  %c1 = qtx.alloca : !qtx.wire
  %t = qtx.alloca : !qtx.wire
  %t_0 = h %t : !qtx.wire
  %t_1 = z [%c0, %c1] %t_0 : [!qtx.wire, !qtx.wire] !qtx.wire
  %t_2 = h %t_1 : !qtx.wire
}

The decomposition pass will walk the IR and substitute all operations with a registered decomposition by a call to the library function that implements the decomposition:

qtx.circuit @x_decomp(%c0: !qtx.wire, %c1: !qtx.wire, %t: !qtx.wire) -> (!qtx.wire, !qtx.wire, !qtx.wire) {
  %t_0 = h %t : !qtx.wire
  %t_1 = z [%c0, %c1] %t_0 : [!qtx.wire, !qtx.wire] !qtx.wire
  %t_2 = h %t_1 : !qtx.wire
  return %c0, %c1, %t_2 : !qtx.wire, !qtx.wire, !qtx.wire
}

qtx.circuit @foo() {
  %c0 = qtx.alloca : !qtx.wire
  %c1 = qtx.alloca : !qtx.wire
  %t = qtx.alloca : !qtx.wire
  %c0_0, %c1_0, %t_0 = qtx.call @x_decomp(%c0, %c1, %t) : ...
}

In this simple example, an extra inlining step is necessary to get us to the same result as the solution base on rewrite patterns

Details

The discussion until this point served the purpose of painting a broader picture of what decomposition is, its role in the compilation, and a bird's eye view of the two alternative implementations. At this point, these alternatives might seem similar. However, choosing one over the other will have nontrivial implications for the compiler and the compilation process.

I will assume that using the rewrite patterns is the baseline ("natural") way of implementing decomposition. This is because the rewrite patterns and library functions will look similar (minus the MLIR boilerplate, which we can minimize). Hence, from the point of view of someone implementing decomposition patterns, there is little difference (code-wise). Also, we already have decomposition implemented as rewrite patterns, so I presume the implications of this method are understood and will focus on the details of using library functions.

The main disadvantage of the rewrite patterns method is that adding or changing decompositions requires changing the compiler. This disadvantage primarily motivates the use of library functions.

The first couple of questions that arise for using library functions are:

We could place these functions in a header file, say cudaq/decomposition.h, that would be included in the cudaq header, guaranteeing that the decomposition function will be available to the compiler. The downside here is that the compiler will do the work of parsing and compiling all decomposition functions in all translation units. Also, note that the compiler will need to know the name of these functions, so adding new decomposition patterns would still require changing the compiler unless we pass these names as configuration options.

Once we settle these questions, we turn to what will happen during compilation that relies on library functions. The careful reader might have noticed that in our previous example, the rewrite pattern handles X operations with an arbitrary number of controls, while the library function is limited to operations of two controls. We can fix this by changing the library function:

void x_decomp(cudaq::qreg &controls, cudaq::qubit &t) {
  h(t);
  z<cudaq::ctrl>(controls, t);
  h(t);
}

This new function complicates decomposition in QTX because the number controls can only be bounded at decomposition time. Specifically, the compilation won't be able to lower this function to QTX, so before doing decomposition, we will have something like this:

func.func @x_decomp(%controls: !quake.vec<?>, %t: !quake.qref) {
  h %t
  z [%controls] %t
  h %t
}

qtx.circuit @foo() {
  %c0 = qtx.alloca : !qtx.wire
  %c1 = qtx.alloca : !qtx.wire
  %t = qtx.alloca : !qtx.wire
  %t_0 = x [%c0, %c1] %t : [!qtx.wire, !qtx.wire] !qtx.wire
}

Now, it needs to deal with the following problems:

After dealing with issues, we are still left with the problem that likely burdens the compiler the most. When using a library function, the decomposition process does not straightforwardly generates the sequence of operations that implements the gate; instead, it injects a function that, if executed, would do so. We then rely on the compiler optimization capabilities to get us this sequence. In our toy example, this was a simple inlining of the function; however, not all decompositions are this simple.

Summary

It seems to me that using rewrite patterns is a better solution. We already have simple decompositions implemented using them, and we understand the technical implications. Furthermore, from a compiler developer perspective, the difference between writing a pattern and a decomposition function seems minimal, so there is little gain. (This assumes all infrastructure required to support using library functions for decomposition is done.) From a user perspective, library functions could enable an easy way to change decompositions without changing the compiler, but what percentage of users would want to do that? And since there is an extra cost for allowing this, are we willing to impose it on all users?

bettinaheim commented 1 year ago

I've got a suggestion for a revised definition of decomposition. As you describe "higher-level" and "lower-level" is largely meaningless without a backend definition. What matters is what instructions are supported by a backend. We also are specifically talking about the instructions on the quantum accelerator rather than anything else, and I would include measurements in that definition, since these vary depending on backend as well. The need for automatic decomposition arises since application code is written against a generic API that shouldn't (need to) be QPU specific. Hence, I believe the following definition is useful:

We define "decomposition" as the translation of the CUDA Quantum default quantum instruction set as defined here (*) into a backend specific set of quantum instructions. By quantum instructions, we specifically mean the set of instructions that are exposed by a particular quantum backend architecture (links here will also be nice in the future). Note that quantum instructions may have a destructive effect on the quantum state.

The rest seems like an implementation details to me - at the end, what matters is that after compilation, the compiled quantum kernels only contain what is permitted according to the IR profile specification for that target.

bettinaheim commented 1 year ago

I would like to make the following distinction for the purpose of discussion:

These topics may be related but I believe it helps to consider them as separate functionalities to be handled by separate components or passes in the stack. We can discuss code reuse (e.g. of analysis passes) for multiple of these once we have a clear architectural understanding of how to approach each one.

bettinaheim commented 1 year ago

Let me also make a stab at formulating the goals for the implementation - how to do that we are pretty free on and is up for discussion here:

bettinaheim commented 1 year ago

Regarding making it easy to define custom decompositions, in principle we can choose any format for how to define these. However, I would consider/prioritize the following:

amccaskey commented 1 year ago

Thanks for writing this all up @boschmitt. There is a lot of great content here. I decided to take some time over the weekend to think on this.

Ultimately I think that there can be room for both of these approaches, and that by doing so we address the goals laid out above by @bettinaheim.

One can always contribute and MLIR RewritePattern pass for any kind of decomposition / synthesis strategy. This in my opinion should be the default for most cases that are implemented by core CUDA Quantum engineers. I also particularly think that this should be the approach for multi-control decomposition to single and two qubit operations.

But I do think that we should think about how one could contribute something like simple native gate set mappings (but not infinite-to-finite instruction set mappings, like multi-ctrl decompositions) as external MLIR code. My thinking here (and I think this is how @bettinaheim is thinking too) is as follows - an external developer with no knowledge of MLIR should be able to come in and define a gate set mapping via CUDA Quantum C++ kernel code, use cuda-quake to generate an MLIR file that can then be installed to a known plugin folder. It would be easy to extend the compiler to look in that location, if requested, to perform the mapping. To me this is an on-ramp for external developers and it gets more vendors involved quickly with exposing CUDA Quantum to their hardware. They will lose any optimization opportunities after their MLIR code has been injected, but they can go the RewritePattern route if that's what they want.

schweitzpgi commented 1 year ago

Note that this definition still has some loose ends; specifically, it needs more clarity about which gates are considered high-level and low-level. As we will see next, such clarification requires more context.

The QIR jargon for this low-level set of operations is "base profile", AFAIK. We could also use terms like target instruction/gate set or target primitive gates to borrow from compiler vocabulary.

In this sense, the compiler cannot blindingly decompose operations; the process must move towards these primitive operations and, hopefully, not get lost or enter a loop.

To sum up in a word: convergent. If you think of the rewrites as nodes in a lattice, we want the rewrite rules to flow in a single direction. If we can constrain this a bit, we could prove (offline) if there are cycles in the lattice. (In your example, that means determining X -> { Z ... } and Z -> { X ... } form a cycle.

Some additional thoughts on the solution space: