amccaskey commented 11 months ago

Executive Summary

Our current model of expressing quantum code in Python via the kernel_builder is unscalable and is now beginning to hurt performance of our library and simulation work. We should start work on embedding the CUDA Quantum language specification in Python itself (note, this has been the plan, I'm proposing we do it sooner rather than later).

Background

We have devoted the majority of our resources to extending C++ for quantum computing as specified in the CUDA Quantum language model. This is a major task and has required visitor extensions to the Clang compiler framework, MLIR dialects and passes, and an associated runtime library for algorithms and simulation. We decided early on that an easy way to get CUDA Quantum to Python programmers sooner rather than later was to simply expose the cudaq::kernel_builder type and allow programmers to build Quake code internally, and JIT compile it before execution via the runtime library. This has come at the cost of not fully exposing the CUDA Quantum language model to Python users, but our goal has been to embed the language in Python in the future. The language model is needed in order to efficiently express quantum-classical code in CUDA Quantum kernels with full generality in control flow.

This JIT model has come at a price - in the absence of JIT caching (which we do not do yet #517), one will begin to notice that the JIT compilation step for large circuits will ultimately degrade performance, even when targeting our GPU simulation backends (thus making it appear that the entire workflow is not performant, even when running on the GPUs). There are many workloads where JIT caching will not help this and should not be considered as a near-term workaround. If one is using the kernel_builder in an interpreted environment like Python, then all generated Quake code is fully unrolled and lacks any non-trivial control flow. Hence, the "code-sizes" that the internal mlir::ExecutionEngine has to JIT compile are very large and that time dominates.

Solution - Begin Language Implementation

The proposed solution for this problem would be to start on the CUDA Quantum language embedding in Python. Of course, that brings up the question - what does that mean? The full goal for this would be to enable some kind of Python ASTBridge just like we do in C++. I think this is too lofty for the project at this point. We should instead step towards this incrementally. A first step toward this would be to embed library-mode into Python, which means exposing the qubit_qis.h functions and types to Python and allowing expressions like the following

@cudaq.kernel
def simple(numQubits):
    qubits = cudaq.qvector(numQubits)
    h(qubits.front())
    for i, qubit in enumerate(qubits.front(numQubits-1)):
        x.ctrl(qubit, qubits[i+1])

counts = cudaq.sample(simple, 10)

By embedding the language in this way we perfectly emulate the C++ library-mode (the default for simulations). Each interpreted statement leads to an equivalent function call in the cudaq.so library, which directly affects simulation on the nvqir::CircuitSimulator API. There is no mapping to Quake here. In the future I believe that we can adapt this model to directly map to Quake and pick up our compiler optimization / transformation workflow.

Let's take a quick look at the benefits of this approach. Consider a hardware-efficient ansatz parameterized on the number of entangling layers. In the current model, this would look like

kernel, thetas = cudaq.make_kernel(list)
qubits = kernel.qalloc(numQubits)
t = 0
for i in range(numQubits):
    kernel.ry(thetas[t], q[i])
    kernel.rz(thetas[t+1], q[i])
    t += 2 

for l in range(numLayers):
    for i in range(numQubits-1):
        kernel.cx(q[i], q[i+1])
    for i in range(numQubits):
        kernel.ry(p[t], q[i])
        kernel.rz(p[t+1],q[i])
        t+=2

In the runtime model, this would look like

@cudaq.kernel
def hwe(p):
    q = cudaq.qvector(numQubits)
    t = 0
    for i in range(numQubits):
        cudaq.ry(p[t], q[i])
        cudaq.rz(p[t+1], q[i])
        t+=2

    for l in range(numLayers):
        for i in range(numQubits-1):
            x.ctrl(q[i], q[i+1])
        for i in range(numQubits):
            ry(p[t], q[i])
            rz(p[t+1],q[i])
            t+=2

If we time kernel.jit_code() vs just calling the hwe runtime kernel we see where the bottleneck really is jitTimes The kernel_builder JIT time is not even including the time it would take to then execute the kernel. The hwe runtime fully affects simulation on the backend and is done in a few milliseconds.

Now in the future, when we have a better handle on mapping Python kernels to Quake, I think we can better leverage JIT caching (with control flow in the MLIR code, I suspect that more efficient kernels can be created that are applicable to many iterations of execution in the typical application workflow). But in the absence of control flow from Python to Quake, we are likely going to see programmers incur this JIT compilation penalty for every circuit created in a large hybrid application.

Addressing remote QPUs

Our current remote QPU execution model relies on the Quake representation. A purely runtime-library approach will therefore not work for this model. This can be circumvented via a new ExecutionManager extension that does not delegate quantum instruction invocations to the NVQIR library, but instead builds the Quake representation.

schweitzpgi commented 11 months ago

Can you explain how the different syntax does (or does not) change the conceptual model? (It doesn't seem like it does.) On the other hand, there is a difference in that the current way builds a kernel at the MLIR level, while the proposal is to use decorators to modify a "default" python object at the python level. If I understand correctly, that python object skips MLIR and JIT compilation entirely. All this is a roundabout way of asking if the decorator pattern is critical to the performance gain or whether the original "kernel builder" syntax could be used, but omit the MLIR and JIT steps to make it more performative?

amccaskey commented 11 months ago

From a strictly performance perspective, you could definitely provide a "kernel-builder-like" API that skips building Quake/CC code and directly affects execution on the backend simulators. However, I'm thinking about more than performance with the introduction of this decorator-based language extension. In my opinion, we should be moving away from these builder-like APIs toward a language embedding so that we pick up the language's classical control flow, just like we do in C++. I want to move us away from

kernel = cudaq.make_kernel()
...
kernel.for_loop(...);
...
kernel.c_if(...)

to the more natural and flexible

@cudaq.kernel 
def kernel():
    ...
    for i in range(numQubits):
       for k in range(10):
           ....
    ...
    if mz(q[0]):
        x.ctrl(q[2],q[3])

and this latter syntax is mapped to MLIR (in a more compact way, retaining CC control flow) at runtime via the decorator. The decorator is a requirement to provide a hook we can grab and map to MLIR.

NVIDIA / cuda-quantum

[RFC] On the need for expressing the language in Python #639

Executive Summary

Background

Solution - Begin Language Implementation

Addressing remote QPUs